Comparing Raw table against look up table to find missing values

  • Hello Everyone,

    I have a raw table with Billions of records. I have a look up table with around 1000 records. The look up table lists the parent diseases and their corresponding child diseases. So one parent disease can have many corresponding child diseases.
    What I need to find out from my raw data is the list of child diseases which is not present for each corresponding parent disease as per the defined list by the look up table.
    Suppose Diabetes has 10 child disease and my raw table is listing only 4 of them. So I need to a get a list of the remaining 6 child diseases.
    I got my query done using 'innr join' and 'except' but given the billions of records in my raw table, my query takes a lot of time.

    Please help.

  • The query is going to be pretty "simple", going with an inner join is the right way to go. You're probably looking at a query something along the lines of:
    SELECT D.DiseaseName, cD.DiseaseName
    FROM Disease D
        JOIN Disease cD ON D.DiseaseID = cD.ParentDiseaseID
    WHERE cD.DiseaseID NOT IN (SELECT P.DiseaseID
                               FROM Patient P);

    Without DDL I've totally guessed your table design and Column names.

    My thought for the reason it's slow though could be your indexes. Have you got any? If so, where are they? For something like this, at a minimum you'll want an index on your Disease and ParentDisease IDs in your Disease table, but also on your Patient table. Could you provide a copy of your query plan? That'll make things easier for users to see where the main cost of your query is, and how/where it might be improved.

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • Thom A - Wednesday, May 17, 2017 2:07 AM

    The query is going to be pretty "simple", going with an inner join is the right way to go. You're probably looking at a query something along the lines of:
    SELECT D.DiseaseName, cD.DiseaseName
    FROM Disease D
        JOIN Disease cD ON D.DiseaseID = cD.ParentDiseaseID
    WHERE cD.DiseaseID NOT IN (SELECT P.DiseaseID
                               FROM Patient P);

    Without DDL I've totally guessed your table design and Column names.

    My thought for the reason it's slow though could be your indexes. Have you got any? If so, where are they? For something like this, at a minimum you'll want an index on your Disease and ParentDisease IDs in your Disease table, but also on your Patient table. Could you provide a copy of your query plan? That'll make things easier for users to see where the main cost of your query is, and how/where it might be improved.

    My first impression was that a LEFT OUTER JOIN was the right way to go.  Of course, DDL and sample data would go a long way to resolving this.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply