Cross Referencing Tables

  • Hi everyone,

    I am hoping that someone can help me with a query based on the below:

    I have two tables. One contains a list of potential duplicate records, grouped by a column, id_master

    The second is a list of pairs of records that should not be merged as duplicates.

    I am looking to delete that pairs of records in Table_1 where they match a pair in Table_2.

    Here are some sample records:

    Create table #duplicates

    (id int,

    First nvarchar (50),

    Last nvarchar (50),

    Person_id int,

    Id_master int)

    Create table #blacklist

    (id_1 int,

    Id_2 int)

    insert into #duplicates (id, first, last, person_id, id_master)

    select 1, 'Mark', 'Wonder', 10223, 7

    union all

    select 2, 'Simon', 'Waterhouse', 15226, 4

    union all

    select 3, 'Vanessa', 'Anderson',21321, 9

    union all

    select 4, 'Simon','Waterhouse', 45621, 4

    union all

    select 5, 'Raymond', 'Rooney', 63325, 10

    union all

    select 6, 'Vanessa', 'Anderson', 54123, 9

    union all

    select 7, 'Mark', 'Wonder', 15668, 7

    union all

    select 8, 'Layla', 'Danana', 42213, 11

    union all

    select 9, 'Vanessa', 'Anderson',75521, 9

    union all

    select 10, 'Raymond', 'Rooney', 63325, 10

    union all

    select 11, 'Layla', 'Danana', 96632, 11

    insert into #blacklist (id_1, id_2)

    select 21321, 75521

    union all

    select 96632, 42213

    union all

    select 15526, 63325

    select * from #duplicates

    select * from #blacklist

    From the above I would like to delete the following ids from the duplicates table:

    3 & 9, 8 & 11

    I would not expect 2 & 10 to be deleted because they don't share the same id_master, meaning that they are not a part of a group in the duplicates table

    I have puzzled over this and all I have managed to do so far is remove records if both pairs in the blacklist table EXIST in the duplicates table. I am not able to verify that they are both in a group of duplicates.

    I hope someone can educate and assist!

    Thanks,

    Paul

  • Two different solutions.

    -- query 1

    SELECT o.id, o.first, o.last, o.person_id, o.id_master

    FROM (

    SELECT d.id, d.first, d.last, d.person_id, d.id_master,

    x = COUNT(*) OVER (PARTITION BY d.person_id)

    FROM #duplicates d

    INNER JOIN #blacklist b

    ON b.id_1 = d.Person_id OR b.id_2 = d.Person_id

    ) o

    WHERE x = 1

    -- query 2

    SELECT a.*

    FROM #duplicates a

    INNER JOIN (

    SELECT id1 = d1.id, id2 = d2.id

    FROM #blacklist b

    LEFT JOIN #duplicates d1 ON d1.Person_id = b.id_1

    LEFT JOIN #duplicates d2 ON d2.Person_id = b.id_2

    WHERE d1.Person_id <> d2.Person_id

    ) d ON d.id1 = a.id OR d.id2 = a.id

    The results of either could be used in a DELETE FROM.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Thanks very much for your assistance, Chris. That works great.

    I prefer to use query 2, as I can understand it. I am not too sure on how partitioning works.

    I appreciate your help.

    Paul

  • Paul_Harvey (6/16/2010)


    Thanks very much for your assistance, Chris. That works great.

    I prefer to use query 2, as I can understand it. I am not too sure on how partitioning works.

    I appreciate your help.

    Paul

    You're welcome, Paul, thanks for the feedback.

    PARTITION BY is explained pretty well in BOL. A really cool use is

    COUNT(columnname or *) OVER (PARTITION BY columnname), which gives you the number of rows in a partition (a range of equal values in a column) without rolling them up - the same value appears in each row of a partition. SUM(columnname) works too.

    SELECT PartitionElements = COUNT(*) OVER (PARTITION BY [last]),

    UselessValue = SUM(id) OVER (PARTITION BY [last]),

    id, [first], [last], person_id, id_master

    FROM #duplicates

    ORDER BY [Last], [first]

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • I'll check it out.

    Thanks again.

    🙂

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply