Comparing Raw table against look up table to find missing values

Question

Comparing Raw table against look up table to find missing values

Prasune Verma

SSC-Addicted

Points: 453
More actions
May 16, 2017 at 11:56 pm

#402068

Hello Everyone,
I have a raw table with Billions of records. I have a look up table with around 1000 records. The look up table lists the parent diseases and their corresponding child diseases. So one parent disease can have many corresponding child diseases.
What I need to find out from my raw data is the list of child diseases which is not present for each corresponding parent disease as per the defined list by the look up table.
Suppose Diabetes has 10 child disease and my raw table is listing only 4 of them. So I need to a get a list of the remaining 6 child diseases.
I got my query done using 'innr join' and 'except' but given the billions of records in my raw table, my query takes a lot of time.
Please help.

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply

Thom A SSC Guru Points: 99127 More actions · Answer 1

The query is going to be pretty "simple", going with an inner join is the right way to go. You're probably looking at a query something along the lines of:
SELECT D.DiseaseName, cD.DiseaseName FROM Disease D JOIN Disease cD ON D.DiseaseID = cD.ParentDiseaseID WHERE cD.DiseaseID NOT IN (SELECT P.DiseaseID FROM Patient P);
Without DDL I've totally guessed your table design and Column names.

My thought for the reason it's slow though could be your indexes. Have you got any? If so, where are they? For something like this, at a minimum you'll want an index on your Disease and ParentDisease IDs in your Disease table, but also on your Patient table. Could you provide a copy of your query plan? That'll make things easier for users to see where the main cost of your query is, and how/where it might be improved.

Thom~

Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk

drew.allen SSC Guru Points: 76960 More actions · Answer 2

Thom A - Wednesday, May 17, 2017 2:07 AM
The query is going to be pretty "simple", going with an inner join is the right way to go. You're probably looking at a query something along the lines of:
SELECT D.DiseaseName, cD.DiseaseName FROM Disease D JOIN Disease cD ON D.DiseaseID = cD.ParentDiseaseID WHERE cD.DiseaseID NOT IN (SELECT P.DiseaseID FROM Patient P);
Without DDL I've totally guessed your table design and Column names.
My thought for the reason it's slow though could be your indexes. Have you got any? If so, where are they? For something like this, at a minimum you'll want an index on your Disease and ParentDisease IDs in your Disease table, but also on your Patient table. Could you provide a copy of your query plan? That'll make things easier for users to see where the main cost of your query is, and how/where it might be improved.

My first impression was that a LEFT OUTER JOIN was the right way to go. Of course, DDL and sample data would go a long way to resolving this.

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA