Need a query to select aggregate data by buckets

  • I have a table which has a unique integer column. It's not contiguous, it has some gaps.

    I need to find a count of records within amaginary buckets, for example [1-10],[11-20],[21-30].....

    Thanks

  • SQL Guy 1 (3/24/2014)


    I have a table which has a unique integer column. It's not contiguous, it has some gaps.

    I need to find a count of records within amaginary buckets, for example [1-10],[11-20],[21-30].....

    Thanks

    Would help if you provided more details on what you are trying to accomplish. Just based on the info provided I know there isn't much I can do to help.

  • That was rather simplified example. In reality I have a table with 150 mln records, and it is partitioned by integer column (it is some ID and clustered primary key) with buckets of 10 mln each. It also has some modify date. I need to find out min and max of that modify date by buckets of that 10 mln ID's. For example I need to find modifydate where ID in (1, 1000000). Next for ID in (1000001, 2000000), and so on. However, some ID's are skipped. so there is no record with ID=2000000. In such a case I need to find the closest one, like 1998412.

    I can write a query with multiple unions, but want to find a neater solution.

  • SQL Guy 1 (3/25/2014)


    That was rather simplified example. In reality I have a table with 150 mln records, and it is partitioned by integer column (it is some ID and clustered primary key) with buckets of 10 mln each. It also has some modify date. I need to find out min and max of that modify date by buckets of that 10 mln ID's. For example I need to find modifydate where ID in (1, 1000000). Next for ID in (1000001, 2000000), and so on. However, some ID's are skipped. so there is no record with ID=2000000. In such a case I need to find the closest one, like 1998412.

    I can write a query with multiple unions, but want to find a neater solution.

    If "it is partitioned by integer column", can't you simply aggregate by that column and grab the min and max date?

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • SQL Guy 1 (3/25/2014)


    That was rather simplified example. In reality I have a table with 150 mln records, and it is partitioned by integer column (it is some ID and clustered primary key) with buckets of 10 mln each. It also has some modify date. I need to find out min and max of that modify date by buckets of that 10 mln ID's. For example I need to find modifydate where ID in (1, 1000000). Next for ID in (1000001, 2000000), and so on. However, some ID's are skipped. so there is no record with ID=2000000. In such a case I need to find the closest one, like 1998412.

    I can write a query with multiple unions, but want to find a neater solution.

    Please take the time to read the following article and follow its instructions regarding what you should post and how to do it.

    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    Based on what you have posted so far, I really have no idea where to start. I can't see what you see. You need to help us help you.

  • SQL Guy 1 (3/24/2014)


    I have a table which has a unique integer column. It's not contiguous, it has some gaps.

    I need to find a count of records within amaginary buckets, for example [1-10],[11-20],[21-30].....

    Thanks

    I have two examples, first is where the 'bucket' size is constant.

    -- Count with a bucket size of a 1000

    -- CCCCCC is an integer column in the table TTTTTT

    -- DDDDDD can be any field for example it can be a datetime field.

    --

    select COUNT(*) as Hello, MIN(DDDDDD), MAX(DDDDDD)

    CCCCCC/1000 -- Line 1

    from TTTTTT

    GROUP BY

    CCCCCC/1000 -- Line 2

    ORDER BY

    CCCCCC/1000 -- Line 3

    -- The three marked lines each contain the same code.

    -- If one is replaced the others should be replaced with the same construction.

    --

    Example 2,

    Here I use a logaritmic scale, which is used on two times, the time difference is of importance. The buckets are logaritmic in scale for the time difference.

    --

    -- Difference in seconds between two events.

    -- Shown in bucket of 'ever' increasing sizes.

    --

    select top 30 count(*),

    round(4*log10((convert(float, nextevent)-convert(float, event))*24*60*60+.01),0)

    from TTTTTT

    group by

    round(4*log10((convert(float, nextevent)-convert(float, event))*24*60*60+.01),0)

    order by

    round(4*log10((convert(float, nextevent)-convert(float, event))*24*60*60+.01),0)

    The last example has a prevention against zero's and that is adding a .01.

    Now a logaritmic scale of difference is shown and the number of events in within this.

    Both examples show a line repeating three times. You can construct all sorts of 'grouping' replacing these three lines.

    It could be substring(CCCCCC, 17, 3)

    It could be replace(CCCCCC, 'x', 'y')

    etc.

    Please report if this is an anwser in the right direction.

    For the logaritm scale you 'probably' need another 'field' which marks for example the max of each bucket. (Or the min or both). This is left to your own fantasy.

    Ben

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply