what's the best practice to count distinct?

  • Good day,

    I have a task to count distinct records in a big table with roughly 30M records, performance is an important factor. Query is to be written to calculate weekly stats, weekly record number could be as high as 1M.

    The actual result is like:

    ID Policy

    350235744Credit Cards

    350235744PCI

    350235744PCI Audit

    So the final number for this particular Policy is 3

    I can write the query like:

    select count(distinct Incident_id) policy_name

    from Reporting_DailyDlpDetail

    Where (year(INSERT_DETECT_TS)=2015) and (month(INSERT_DETECT_TS) =6) and (day(INSERT_DETECT_TS) between 2 and 9)

    This returns 526254 and costs 11 seconds to complete

    or a query like:

    Select distinct Incident_id, policy_name

    from Reporting_DailyDlpDetail

    Where (year(INSERT_DETECT_TS)=2015) and (month(INSERT_DETECT_TS) =6) and (day(INSERT_DETECT_TS) between 2 and 9)

    This returns 749687 and costs roughly 1 minute to complete.

    Result is different from the two queries, I believe the later gives correct number.

    Questions here:

    How can I count the distinct based on a combo?

    Considering the size of data, what is the best and most efficient way to run the stats calculation against over 30 different scenarios (different policies and alert types) and not timeout?

    Any suggestion is welcome.

  • Edit: changed these comments:

    Most important is to change the WHERE to not use functions on the table column.

    You should also very strongly analyze clustering the table by INSERT_DETECT_TS (add Incident_Id to the clus key if you want), if it's not already.

    SELECT DISTINCT Incident_id, policy_name

    FROM Reporting_DailyDlpDetail

    WHERE INSERT_DETECT_TS >= '20150602' AND INSERT_DETECT_TS < '20150610'

    If you only need the total, you can add an outer query:

    SELECT COUNT(*) AS Distinct_Total

    FROM (

    SELECT DISTINCT Incident_id, policy_name

    FROM Reporting_DailyDlpDetail

    WHERE INSERT_DETECT_TS >= '20150602' AND INSERT_DETECT_TS < '20150610'

    ) AS derived

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • Thank you. Of the two queries, which returns the correct result?

  • halifaxdal (7/7/2015)


    Thank you. Of the two queries, which returns the correct result?

    You tell us.

    Try providing us with the DDL for the table(s), sample data for the table(s), and expected results based on the sample data you provide.

    We can't see what you see, and we can't answer a question that we don't know what it is you are trying to accomplish.

  • To add on to Lynn's comment: the two queries you posted aren't calculating the same thing. Let me give you a hint;; the FIRST query you posted counts the distinct number of INCIDENTS and calls the result policy name. The second one returns one row for each ..... (fill in the blanks and I think you will be able to ascertain which one is correct per your requirement).

    I'd venture to guess that scott's queries are consistent (i.e. the totals only queries returns the number of rows you'll get back in the detail query).

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • halifaxdal (7/7/2015)


    Thank you. Of the two queries, which returns the correct result?

    In some cases they give me same result, in some cases, different number. That's why I don't understand and ask here.

  • halifaxdal (7/7/2015)


    halifaxdal (7/7/2015)


    Thank you. Of the two queries, which returns the correct result?

    In some cases they give me same result, in some cases, different number. That's why I don't understand and ask here.

    You are the only one who can tell us which is correct. We can't see your data and you really haven't told us what the expected results should look like.

    You need to help us help you. DDL for the table(s) involved in the query, sample data for the table(s) involved in the query, and expected results based on the sample data provided.

  • Your first query was getting the count of distinct Incident_id, while your second query was getting distinct combinations of Incident_id and policy_name. Surely you can see those two are not the same. They will only return the same count if every Incident_id only has one policy_name associated with it.

    Now, which of the 2 cases are you looking for? I thought it was obvious you were looking for the second choice, but I'm not going to assume.



    Alvin Ramard
    Memphis PASS Chapter[/url]

    All my SSC forum answers come with a money back guarantee. If you didn't like the answer then I'll gladly refund what you paid for it.

    For best practices on asking questions, please read the following article: Forum Etiquette: How to post data/code on a forum to get the best help[/url]

  • You are the only one who can tell us which is correct. We can't see your data and you really haven't told us what the expected results should look like.

    You need to help us help you. DDL for the table(s) involved in the query, sample data for the table(s) involved in the query, and expected results based on the sample data provided.

    I cannot provide data here as that mostly likely violates our policy. :crying:

  • Alvin Ramard (7/7/2015)


    Your first query was getting the count of distinct Incident_id, while your second query was getting distinct combinations of Incident_id and policy_name. Surely you can see those two are not the same. They will only return the same count if every Incident_id only has one policy_name associated with it.

    Now, which of the 2 cases are you looking for? I thought it was obvious you were looking for the second choice, but I'm not going to assume.

    I am looking for the second scenarios so Scott's query works for it.

    Thanks

  • halifaxdal (7/7/2015)


    You are the only one who can tell us which is correct. We can't see your data and you really haven't told us what the expected results should look like.

    You need to help us help you. DDL for the table(s) involved in the query, sample data for the table(s) involved in the query, and expected results based on the sample data provided.

    I cannot provide data here as that mostly likely violates our policy. :crying:

    Sample data does not mean production data. You should be able to mock some data that is representative of your problem domain so that we can understand what you are trying to accomplish.

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply