Frequency Counts Huge Table

  • Hey guys,

    I have a table with about 80 columns and 400 millions records. Each columns has different responses that I need to get frequency for. I need to get counts for each response from all the columns... I have a query that does it, but it will run forever... what is the best way to do so?

    My starting query:

    select res, sum(cnt) from

    (

    select col1 res, count(*) as cnt from table1 with (nolock)

    group by col1

    union all

    select col2 res, count(*) as cnt from table1 with (nolock)

    group by col2

    ........................

    select col80 res, count(*) as cnt from table1 with (nolock)

    group by col80

    )a group by res

  • You might try breaking it up into smaller pieces.

    For example, create a temp table, and insert the results for column 1 into that. Then as a separate query, insert the results for column 2. And so on, through each of the columns. It's a bit heavy handed, but it might break up the job enough to give the server a chance to do it.

    Indexing the columns won't help much, since you need to count results from each one, and adding 80 indexes to a table will often just confuse the query optimizer. You might try it anyway, just to see if it works. You might be able to create a set of indexed views, one view for each column, and query from those instead of directly from the table, but that's a little bit of a stretch. Either of these will slow down inserts and updates and such, probably by a large margin.

    How often do you have to pull this data? How heavy is the update/delete/insert traffic on this table?

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Another thing you might try, if there's a lot of data being added daily, but if the older data doesn't change much, is set up a nightly batch to dump a summary of the prior day's data into a summary table. Then all you have to do is query the data so far today, and add that to the data from the prior night.

    That won't help much if the older data changes frequently, or if there's no datetime type field in the data indicating the time it was created or last updated. But, if there is such a field, and if the older data is pretty static, it might be a good solution.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Thanks for the response... this table is pretty busy but can afford to run the query for couple of hours... I need to do this as a one time job for now.

  • Ghanta:

    what is the data type and range of the response columns?

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • dataType is char(1) and the values can be a-z, 0, 1, blank, nulls. Thanks.

  • If the number of unique combinations of values is not large, you might start by creating an intermediate table with the rowcount for each unique combination of values across a range of columns. This may be faster, depending on the distribution of values, or it might not help at all or make it worse.

    -- Reduce to unique row combinations

    Insert into CountTable

    select

    col1 as x1,

    col2 as x2,

    col3 as x3,

    ...

    col10 as x10,

    RowCount = count(*)

    from

    MyTable

    group by

    col1,

    col2,

    col3,

    ...

    col10

    union all

    select

    col11 as x1,

    col12 as x2,

    col13 as x3,

    ...

    col20 as x10,

    RowCount = count(*)

    from

    MyTable

    group by

    col11,

    col12,

    col13,

    ...

    col20

    union all

    ... for rest of columns 21 to 80...

    -- Then union to sum rowcounts

    select

    x1 res,

    sum(RowCount) as cnt

    from

    CountTable

    group by

    x1

    union all

    select

    x2 res,

    sum(RowCount) as cnt

    from

    CountTable

    group by

    x2

    union all

    ...

    ...

    select

    x10 res,

    sum(RowCount) as cnt

    from

    CountTable

    group by

    x10

  • Michael Valentine Jones (3/12/2008)


    If the number of unique combinations of values is not large, you might start by creating an intermediate table with the rowcount for each unique combination of values across a range of columns. This may be faster, depending on the distribution of values, or it might not help at all or make it worse.

    This is the same idea that I have been working on, but I think that 5 columns is probably the safe limit and I would want to record them into an integer for better aggregate indexing. You will need a temp table to hold the intermediate stats and then re-collate that table as well

    Here is that solution, except that I only included the first 10 columns (2 groups of 5), you will have to add the rest yourself (not to hard with cut and paste):

    Select Res, sum(Cnt) as Cnt, ColA, ColB, ColC, ColD, ColE INTO #Stats

    FROM (

    Select

    CASE When Col1 between 'a' and 'z' then ascii(lower(Col1)) - 95-- lowercaser letter -> 2-27

    When Col1='0' or Col1='1' then ascii(Col1) - 48-- 0-1

    When Col1=' ' then 28

    When Col1 IS NULL then 29END

    +CASE When Col2 between 'a' and 'z' then ascii(lower(Col2)) - 95-- lowercaser letter -> 2-27

    When Col2='0' or Col2='1' then ascii(Col2) - 48-- 0-1

    When Col2=' ' then 28

    When Col2 IS NULL then 29END * 30

    +CASE When Col3 between 'a' and 'z' then ascii(lower(Col3)) - 95-- lowercaser letter -> 2-27

    When Col3='0' or Col3='1' then ascii(Col3) - 48-- 0-1

    When Col3=' ' then 28

    When Col3 IS NULL then 29END * 900

    +CASE When Col4 between 'a' and 'z' then ascii(lower(Col4)) - 95-- lowercaser letter -> 2-27

    When Col4='0' or Col4='1' then ascii(Col4) - 48-- 0-1

    When Col4=' ' then 28

    When Col4 IS NULL then 29END * 27000

    +CASE When Col5 between 'a' and 'z' then ascii(lower(Col5)) - 95-- lowercaser letter -> 2-27

    When Col5='0' or Col5='1' then ascii(Col5) - 48-- 0-1

    When Col5=' ' then 28

    When Col5 IS NULL then 29END * 810000

    AS res, count_Big(*) as Cnt

    , Min(Col1) as ColA, Min(Col2) as ColB, Min(Col3) as ColC, Min(Col4) as ColD, Min(Col5) as ColE

    From table1 with (nolock)

    Group by

    CASE When Col1 between 'a' and 'z' then ascii(lower(Col1)) - 95-- lowercaser letter -> 2-27

    When Col1='0' or Col1='1' then ascii(Col1) - 48-- 0-1

    When Col1=' ' then 28

    When Col1 IS NULL then 29END

    +CASE When Col2 between 'a' and 'z' then ascii(lower(Col2)) - 95-- lowercaser letter -> 2-27

    When Col2='0' or Col2='1' then ascii(Col2) - 48-- 0-1

    When Col2=' ' then 28

    When Col2 IS NULL then 29END * 30

    +CASE When Col3 between 'a' and 'z' then ascii(lower(Col3)) - 95-- lowercaser letter -> 2-27

    When Col3='0' or Col3='1' then ascii(Col3) - 48-- 0-1

    When Col3=' ' then 28

    When Col3 IS NULL then 29END * 900

    +CASE When Col4 between 'a' and 'z' then ascii(lower(Col4)) - 95-- lowercaser letter -> 2-27

    When Col4='0' or Col4='1' then ascii(Col4) - 48-- 0-1

    When Col4=' ' then 28

    When Col4 IS NULL then 29END * 27000

    +CASE When Col5 between 'a' and 'z' then ascii(lower(Col5)) - 95-- lowercaser letter -> 2-27

    When Col5='0' or Col5='1' then ascii(Col5) - 48-- 0-1

    When Col5=' ' then 28

    When Col5 IS NULL then 29END * 810000

    UNION ALL

    Select

    CASE When Col6 between 'a' and 'z' then ascii(lower(Col6)) - 95-- lowercaser letter -> 2-27

    When Col6='0' or Col6='1' then ascii(Col6) - 48-- 0-1

    When Col6=' ' then 28

    When Col6 IS NULL then 29END

    +CASE When Col7 between 'a' and 'z' then ascii(lower(Col7)) - 95-- lowercaser letter -> 2-27

    When Col7='0' or Col7='1' then ascii(Col7) - 48-- 0-1

    When Col7=' ' then 28

    When Col7 IS NULL then 29END * 30

    +CASE When Col8 between 'a' and 'z' then ascii(lower(Col8)) - 95-- lowercaser letter -> 2-27

    When Col8='0' or Col8='1' then ascii(Col8) - 48-- 0-1

    When Col8=' ' then 28

    When Col8 IS NULL then 29END * 900

    +CASE When Col9 between 'a' and 'z' then ascii(lower(Col9)) - 95-- lowercaser letter -> 2-27

    When Col9='0' or Col9='1' then ascii(Col9) - 48-- 0-1

    When Col9=' ' then 28

    When Col9 IS NULL then 29END * 27000

    +CASE When Col10 between 'a' and 'z' then ascii(lower(Col10)) - 95-- lowercaser letter -> 2-27

    When Col10='0' or Col10='1' then ascii(Col10) - 48-- 0-1

    When Col10=' ' then 28

    When Col10 IS NULL then 29END * 810000

    AS res, count_Big(*) as Cnt

    , Min(Col6) as ColA, Min(Col7) as ColB, Min(Col8) as ColC, Min(Col9) as ColD, Min(Col10) as ColE

    From table1 with (nolock)

    Group by

    CASE When Col6 between 'a' and 'z' then ascii(lower(Col6)) - 95-- lowercaser letter -> 2-27

    When Col6='0' or Col6='1' then ascii(Col6) - 48-- 0-1

    When Col6=' ' then 28

    When Col6 IS NULL then 29END

    +CASE When Col7 between 'a' and 'z' then ascii(lower(Col7)) - 95-- lowercaser letter -> 2-27

    When Col7='0' or Col7='1' then ascii(Col7) - 48-- 0-1

    When Col7=' ' then 28

    When Col7 IS NULL then 29END * 30

    +CASE When Col8 between 'a' and 'z' then ascii(lower(Col8)) - 95-- lowercaser letter -> 2-27

    When Col8='0' or Col8='1' then ascii(Col8) - 48-- 0-1

    When Col8=' ' then 28

    When Col8 IS NULL then 29END * 900

    +CASE When Col9 between 'a' and 'z' then ascii(lower(Col9)) - 95-- lowercaser letter -> 2-27

    When Col9='0' or Col9='1' then ascii(Col9) - 48-- 0-1

    When Col9=' ' then 28

    When Col9 IS NULL then 29END * 27000

    +CASE When Col10 between 'a' and 'z' then ascii(lower(Col10)) - 95-- lowercaser letter -> 2-27

    When Col10='0' or Col10='1' then ascii(Col10) - 48-- 0-1

    When Col10=' ' then 28

    When Col10 IS NULL then 29END * 810000

    UNION ALL

    ... --(fill out to all 40 col's; 8 Selects of 5 cols each)

    )

    Group by Res

    -- now separate & recollate the 5 columns

    Select Res, sum(Cnt) From (

    Select ColA as Res, sum(Cnt) as Cnt from #Stats with (nolock) Group by ColA

    UNION ALL Select ColB as Res, sum(Cnt) as Cnt from #Stats with (nolock) Group by ColB

    UNION ALL Select ColC as Res, sum(Cnt) as Cnt from #Stats with (nolock) Group by ColC

    UNION ALL Select ColD as Res, sum(Cnt) as Cnt from #Stats with (nolock) Group by ColD

    UNION ALL Select ColE as Res, sum(Cnt) as Cnt from #Stats with (nolock) Group by ColE

    )

    Group by Res

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply