July 7, 2015 at 9:03 am
Good day,
I have a task to count distinct records in a big table with roughly 30M records, performance is an important factor. Query is to be written to calculate weekly stats, weekly record number could be as high as 1M.
The actual result is like:
ID Policy
350235744Credit Cards
350235744PCI
350235744PCI Audit
So the final number for this particular Policy is 3
I can write the query like:
select count(distinct Incident_id) policy_name
from Reporting_DailyDlpDetail
Where (year(INSERT_DETECT_TS)=2015) and (month(INSERT_DETECT_TS) =6) and (day(INSERT_DETECT_TS) between 2 and 9)
This returns 526254 and costs 11 seconds to complete
or a query like:
Select distinct Incident_id, policy_name
from Reporting_DailyDlpDetail
Where (year(INSERT_DETECT_TS)=2015) and (month(INSERT_DETECT_TS) =6) and (day(INSERT_DETECT_TS) between 2 and 9)
This returns 749687 and costs roughly 1 minute to complete.
Result is different from the two queries, I believe the later gives correct number.
Questions here:
How can I count the distinct based on a combo?
Considering the size of data, what is the best and most efficient way to run the stats calculation against over 30 different scenarios (different policies and alert types) and not timeout?
Any suggestion is welcome.
July 7, 2015 at 9:36 am
Edit: changed these comments:
Most important is to change the WHERE to not use functions on the table column.
You should also very strongly analyze clustering the table by INSERT_DETECT_TS (add Incident_Id to the clus key if you want), if it's not already.
SELECT DISTINCT Incident_id, policy_name
FROM Reporting_DailyDlpDetail
WHERE INSERT_DETECT_TS >= '20150602' AND INSERT_DETECT_TS < '20150610'
If you only need the total, you can add an outer query:
SELECT COUNT(*) AS Distinct_Total
FROM (
SELECT DISTINCT Incident_id, policy_name
FROM Reporting_DailyDlpDetail
WHERE INSERT_DETECT_TS >= '20150602' AND INSERT_DETECT_TS < '20150610'
) AS derived
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
July 7, 2015 at 10:22 am
Thank you. Of the two queries, which returns the correct result?
July 7, 2015 at 11:02 am
halifaxdal (7/7/2015)
Thank you. Of the two queries, which returns the correct result?
You tell us.
Try providing us with the DDL for the table(s), sample data for the table(s), and expected results based on the sample data you provide.
We can't see what you see, and we can't answer a question that we don't know what it is you are trying to accomplish.
July 7, 2015 at 11:33 am
To add on to Lynn's comment: the two queries you posted aren't calculating the same thing. Let me give you a hint;; the FIRST query you posted counts the distinct number of INCIDENTS and calls the result policy name. The second one returns one row for each ..... (fill in the blanks and I think you will be able to ascertain which one is correct per your requirement).
I'd venture to guess that scott's queries are consistent (i.e. the totals only queries returns the number of rows you'll get back in the detail query).
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
July 7, 2015 at 11:42 am
halifaxdal (7/7/2015)
Thank you. Of the two queries, which returns the correct result?
In some cases they give me same result, in some cases, different number. That's why I don't understand and ask here.
July 7, 2015 at 11:45 am
halifaxdal (7/7/2015)
halifaxdal (7/7/2015)
Thank you. Of the two queries, which returns the correct result?In some cases they give me same result, in some cases, different number. That's why I don't understand and ask here.
You are the only one who can tell us which is correct. We can't see your data and you really haven't told us what the expected results should look like.
You need to help us help you. DDL for the table(s) involved in the query, sample data for the table(s) involved in the query, and expected results based on the sample data provided.
July 7, 2015 at 11:51 am
Your first query was getting the count of distinct Incident_id, while your second query was getting distinct combinations of Incident_id and policy_name. Surely you can see those two are not the same. They will only return the same count if every Incident_id only has one policy_name associated with it.
Now, which of the 2 cases are you looking for? I thought it was obvious you were looking for the second choice, but I'm not going to assume.
For best practices on asking questions, please read the following article: Forum Etiquette: How to post data/code on a forum to get the best help[/url]
July 7, 2015 at 11:53 am
You are the only one who can tell us which is correct. We can't see your data and you really haven't told us what the expected results should look like.
You need to help us help you. DDL for the table(s) involved in the query, sample data for the table(s) involved in the query, and expected results based on the sample data provided.
I cannot provide data here as that mostly likely violates our policy. :crying:
July 7, 2015 at 11:55 am
Alvin Ramard (7/7/2015)
Your first query was getting the count of distinct Incident_id, while your second query was getting distinct combinations of Incident_id and policy_name. Surely you can see those two are not the same. They will only return the same count if every Incident_id only has one policy_name associated with it.Now, which of the 2 cases are you looking for? I thought it was obvious you were looking for the second choice, but I'm not going to assume.
I am looking for the second scenarios so Scott's query works for it.
Thanks
July 7, 2015 at 12:21 pm
halifaxdal (7/7/2015)
You are the only one who can tell us which is correct. We can't see your data and you really haven't told us what the expected results should look like.
You need to help us help you. DDL for the table(s) involved in the query, sample data for the table(s) involved in the query, and expected results based on the sample data provided.
I cannot provide data here as that mostly likely violates our policy. :crying:
Sample data does not mean production data. You should be able to mock some data that is representative of your problem domain so that we can understand what you are trying to accomplish.
Viewing 11 posts - 1 through 10 (of 10 total)
You must be logged in to reply to this topic. Login to reply