March 31, 2005 at 2:27 pm
Is it possible to use Top N Percent with Group By? I have a table with approximately 50 types of datasets and I need to find the top 90% of EACH of those types (not top 90% of the combined total). Can SQL Server do this, or must I (gulp) run the query and store the results 50 times?
Thanks,
Heather
March 31, 2005 at 3:30 pm
A simple bulldozer approach might be to select each type into a #TempTable and then UNION ALL into a final #TempTable for output.
Otherwise, it sounds to me like you are going to need to isolate each type of dataset (SELECT DISTINCT) into a #TempTable and then loop through each of those types to get your TOP 90%. This may actually be a better option since if a new type is added, then this should handle that situation....
I wasn't born stupid - I had to study.
March 31, 2005 at 3:34 pm
Thanks, Farrell. I had a feeling that would be the answer!
March 31, 2005 at 3:34 pm
Assuming that your key field is not the top 10 percent field, I think it would be something like below. I maybe overly complicating it...but this is how I would do it. There may be a more efficient way. I'm sure someone will let us know.
SELECT groupfield1, groupfield2, ..... FROM tablename WHERE keyfield IN (SELECT keyfield FROM tablename WHERE numfield IN (SELECT TOP 10 PERCENT numfield FROM tablename)) GROUP BY groupfield1, groupfield2, .....
Again there may be a better way. I hope there is. And if your keyfield is what your trying to get then you should be able to get rid of the second subquery. Let us know.
----------------
Jim P.
A little bit of this and a little byte of that can cause bloatware.
March 31, 2005 at 3:41 pm
Hmm, interesting. To make it more clear, what I have is an enormous table of records (over 8 million to be exact) that contain records on every state in the US. So for instance, I may have a 30,000 records for Virginia and 50,000 for Texas.
I am doing a sort by a particular numerical value - let's say population, just as an example. So I need to find the top 90% of records for every state by population density.
The problem is that when I try to get all the records using group by State, I get the top 90% of the total - not the top 90% within each state. Unfortunately, it really doesn't look like I can tell SQL to perform the 90% calculation with the group by results, so I suppose I'll have to loop through them one by one.
March 31, 2005 at 3:59 pm
Don't give up. I'm sure there is a way to do this with a set based query, I just can't get my brain wrapped around it just now (too close to going home time ). I would imagine that someone will post a set based query, but if not, I'll look at it more tomorrow.
Steve
March 31, 2005 at 4:08 pm
Thanks, guys.
The good news is, I only have to do this once.
March 31, 2005 at 4:37 pm
You can't use GROUP BY unless all non grouped columns are agregate results. If you want the 90% rows actually individually reported you have to drop GROUP BY and go with just an ordered selection. Is this what you are after?
SELECT A.State, A.Region, A.PopulationDensity
FROM MyTable A
WHERE A.Region IN ( SELECT TOP 90 PERCENT B.Region
FROM MyTable B
WHERE B.State = A.State
ORDER BY B.PopulationDensity )
ORDER BY A.State, A.PopulationDensity
April 1, 2005 at 8:27 am
I know I can use a subquery to get the top 90% of records from a single state and then perform my state summary calculations from there. I was just hoping that I could get the top 90% of each state's records in a single dataset using a group by clause.
What I may end up doing at this point is just outputting the results I need one by one into a new table (unfortunately, with milions of records, a temp table would be tough to use) and then query those new tables respectively.
-heather
April 1, 2005 at 9:52 am
What about creating views (as your groups) and querying (UNION) the views?
-Isaiah
April 1, 2005 at 10:08 am
That's also a possibility. Because I have 8 million records and I need to grab different slices of percentiles per state, I'm just going to create a few additional tables to hold the state information I need.
For example:
--- must run for each state
SELECT TOP 10 PERCENT *
INTO tblStateCensus_top10percent
FROM tblStateCensus
WHERE state = 'AL'
ORDER BY population DESC
SELECT TOP 5 PERCENT *
INTO tblStateCensus_top5percent
FROM tblStateCensus
WHERE state = 'AL'
ORDER BY population DESC
--- run once after all states are run in above queries
SELECT *
INTO tblStateCensus_bottom90percent
FROM tblStateCensus
WHERE id not in (select id from tblStateCensus_top10percent)
SELECT *
INTO tblStateCensus_bottom95percent
FROM tblStateCensus
WHERE id not in (select id from tblStateCensus_top5percent)
Using the 'not in' for the additional percentages to avoid overlap of records.
I could write some code to loop through this, or I could just run the top half of that 50 or so times. Probably the latter will be faster.
As I said, this is one-time data analysis for journalism purposes so the queries don't need to be optimized for performance.
Thank you all for your help!
Heather
April 1, 2005 at 10:59 am
I've had to do something like this before. Using the derived table concept should simplify your life.
select top 90 percent
from
(select fld1,fld2 from MyTestTable group by fld1,fld2)
In this example, do your grouping sql statement, make that the derived table, then select 90 percent of those results.
Good Luck
--Todd
April 1, 2005 at 11:11 am
Unfortunately I can't do that in this case, because I need the top percent to be done within the groupings - so I need the top 90% of records for alabama, and of arkansas, and so on. That gives me the top 90% of the total record number.
It's okay, guys. I'm just outputting the records into a couple of different tables. I'll be finished by the time I figure out a better way to do it. Since it's for a journalism purpose, I don't have much time.
Thanks for all of your help!
Heather
April 1, 2005 at 11:28 am
Instead of tables I would do views. Less data space and once you create the first one in the QA such as:
CREATE VIEW Top_90_AK
AS
SELECT TOP 90 PERCENT *
FROM TableName
WHERE STATE = 'AK'
Then all you have to do is a search and replace for AK with AL and so on.
----------------
Jim P.
A little bit of this and a little byte of that can cause bloatware.
April 1, 2005 at 12:16 pm
Here's somthing to play with (I wouldn't recommend it on a 8 M lines table though).
SELECT O.XType
--, count(*) AS TotalHits_Found
, O.name
, (SELECT CEILING(COUNT(*) * 0.9) FROM dbo.SysObjects O4 WHERE O4.XType = O.XType) as [90%]
, (SELECT COUNT(*) FROM dbo.SysObjects O5 WHERE O5.XType = O.XType) AS [100%]
FROM dbo.SysObjects O
WHERE EXISTS (SELECT * FROM dbo.SysObjects O2 WHERE O2.XType = O.XType AND O2.id = O.id and O2.id IN (SELECT TOP 90 PERCENT id FROM dbo.SysObjects O3 WHERE O3.XType = O.XType ORDER BY O3.id))
--GROUP BY O.XType
ORDER BY O.XType
, O.Name
Viewing 15 posts - 1 through 15 (of 32 total)
You must be logged in to reply to this topic. Login to reply