top n percent WITH group by?

  • Is it possible to use Top N Percent with Group By? I have a table with approximately 50 types of datasets and I need to find the top 90% of EACH of those types (not top 90% of the combined total). Can SQL Server do this, or must I (gulp) run the query and store the results 50 times?

    Thanks,

    Heather

  • A simple bulldozer approach might be to select each type into a #TempTable and then UNION ALL into a final #TempTable for output. 

    Otherwise, it sounds to me like you are going to need to isolate each type of dataset (SELECT DISTINCT) into a #TempTable and then loop through each of those types to get your TOP 90%.  This may actually be a better option since if a new type is added, then this should handle that situation....

    I wasn't born stupid - I had to study.

  • Thanks, Farrell. I had a feeling that would be the answer!

  • Assuming that your key field is not the top 10 percent field, I think it would be something like below. I maybe overly complicating it...but this is how I would do it. There may be a more efficient way. I'm sure someone will let us know.

    SELECT groupfield1, groupfield2, .....
    FROM tablename
    WHERE keyfield IN (SELECT keyfield
                     FROM tablename
                     WHERE numfield 
                   IN (SELECT TOP 10 PERCENT numfield
                      FROM tablename))
    GROUP BY groupfield1, groupfield2, .....

    Again there may be a better way. I hope there is. And if your keyfield is what your trying to get then you should be able to get rid of the second subquery. Let us know.



    ----------------
    Jim P.

    A little bit of this and a little byte of that can cause bloatware.

  • Hmm, interesting. To make it more clear, what I have is an enormous table of records (over 8 million to be exact) that contain records on every state in the US. So for instance, I may have a 30,000 records for Virginia and 50,000 for Texas.

    I am doing a sort by a particular numerical value - let's say population, just as an example. So I need to find the top 90% of records for every state by population density.

    The problem is that when I try to get all the records using group by State, I get the top 90% of the total - not the top 90% within each state. Unfortunately, it really doesn't look like I can tell SQL to perform the 90% calculation with the group by results, so I suppose I'll have to loop through them one by one.

  • Don't give up.  I'm sure there is a way to do this with a set based query, I just can't get my brain wrapped around it just now (too close to going home time ).  I would imagine that someone will post a set based query, but if not, I'll look at it more tomorrow.

    Steve

  • Thanks, guys.

    The good news is, I only have to do this once.

  • You can't use GROUP BY unless all non grouped columns are agregate results. If you want the 90% rows actually individually reported you have to drop GROUP BY and go with just an ordered selection. Is this what you are after?

    SELECT A.State, A.Region, A.PopulationDensity

                FROM MyTable A

                WHERE A.Region IN ( SELECT TOP 90 PERCENT B.Region

                                                       FROM MyTable B

                                                       WHERE B.State = A.State

                                                       ORDER BY B.PopulationDensity )

               ORDER BY A.State, A.PopulationDensity

     

  • I know I can use a subquery to get the top 90% of records from a single state and then perform my state summary calculations from there. I was just hoping that I could get the top 90% of each state's records in a single dataset using a group by clause.

    What I may end up doing at this point is just outputting the results I need one by one into a new table (unfortunately, with milions of records, a temp table would be tough to use) and then query those new tables respectively.

    -heather

  • What about creating views (as your groups) and querying (UNION) the views?


    -Isaiah

  • That's also a possibility. Because I have 8 million records and I need to grab different slices of percentiles per state, I'm just going to create a few additional tables to hold the state information I need.

    For example:

    --- must run for each state

    SELECT TOP 10 PERCENT *

    INTO tblStateCensus_top10percent

    FROM tblStateCensus

    WHERE state = 'AL'

    ORDER BY population DESC

    SELECT TOP 5 PERCENT *

    INTO tblStateCensus_top5percent

    FROM tblStateCensus

    WHERE state = 'AL'

    ORDER BY population DESC

    --- run once after all states are run in above queries

    SELECT *

    INTO tblStateCensus_bottom90percent

    FROM tblStateCensus

    WHERE id not in (select id from tblStateCensus_top10percent)

    SELECT *

    INTO tblStateCensus_bottom95percent

    FROM tblStateCensus

    WHERE id not in (select id from tblStateCensus_top5percent)

    Using the 'not in' for the additional percentages to avoid overlap of records.

    I could write some code to loop through this, or I could just run the top half of that 50 or so times. Probably the latter will be faster.

    As I said, this is one-time data analysis for journalism purposes so the queries don't need to be optimized for performance.

    Thank you all for your help!

    Heather

  • I've had to do something like this before.  Using the derived table concept should simplify your life. 

    select top 90 percent

     from

     (select fld1,fld2 from MyTestTable group by fld1,fld2)

    In this example, do your grouping sql statement, make that the derived table, then select 90 percent of those results.

    Good Luck

    --Todd

  • Unfortunately I can't do that in this case, because I need the top percent to be done within the groupings - so I need the top 90% of records for alabama, and of arkansas, and so on. That gives me the top 90% of the total record number.

    It's okay, guys. I'm just outputting the records into a couple of different tables. I'll be finished by the time I figure out a better way to do it. Since it's for a journalism purpose, I don't have much time.

    Thanks for all of your help!

    Heather

  • Instead of tables I would do views. Less data space and once you create the first one in the QA such as:

    CREATE VIEW Top_90_AK

    AS

    SELECT TOP 90 PERCENT *

    FROM TableName

    WHERE STATE = 'AK'

    Then all you have to do is a search and replace for AK with AL and so on.



    ----------------
    Jim P.

    A little bit of this and a little byte of that can cause bloatware.

  • Here's somthing to play with (I wouldn't recommend it on a 8 M lines table though).

    SELECT O.XType

    --, count(*) AS TotalHits_Found

    , O.name

    , (SELECT CEILING(COUNT(*) * 0.9) FROM dbo.SysObjects O4 WHERE O4.XType = O.XType) as [90%]

    , (SELECT COUNT(*) FROM dbo.SysObjects O5 WHERE O5.XType = O.XType) AS [100%]

    FROM dbo.SysObjects O

    WHERE EXISTS (SELECT * FROM dbo.SysObjects O2 WHERE O2.XType = O.XType AND O2.id = O.id and O2.id IN (SELECT TOP 90 PERCENT id FROM dbo.SysObjects O3 WHERE O3.XType = O.XType ORDER BY O3.id))

    --GROUP BY O.XType

    ORDER BY O.XType

    , O.Name

Viewing 15 posts - 1 through 15 (of 32 total)

You must be logged in to reply to this topic. Login to reply