Unique Data Generalization

  • I have three tables (source) with same col's. Each table have approx. one billion rows.

    Table A (id, col1) [col1 is unique in Table A]

    Table B (id, col1) [col1 is unique in Table B]

    Table C (id, col1) [col1 is unique in Table C]

    The data in col1 is unique for that respective table but could be duplicated across Table A, in Table B, in Table C

    I want to combine these three tables in a single / destination table,Table D (id, col1) [col1 is unique in Table D].

    In Table D, col1 will contain unique values from Table A (col1), Table B (col1), Table C (col1).

    How i am doing right is getting 5000 records from Table A, Table B, and Table C in a temp table, selecting distinct against that temp table and inserting the records into Table D by checking if the record does not exist in Table D.

    I have tried various combination of numbers to select from the three tables (source).

    The problem is that this approach works fine until the time Table D has few number of records (say around 50 million), but when the records inside table Table D grows, then it takes lots of time.

    I was wondering if there is more efficient approach to come over this problem.

    Any help appreciated.

  • I don't have 3 billion record tables to test with but...

    Wouldn't something like this work?

    INSERT INTO TABLED

    SELECT

    main.col1

    FROM

    (

    SELECT col1 FROM TABLEA

    UNION

    SELECT col1 FROM TABLEB

    UNION

    SELECT col1 FROM TABLEC

    ) AS main

    GROUP BY main.col1



    --Mark Tassin
    MCITP - SQL Server DBA
    Proud member of the Anti-RBAR alliance.
    For help with Performance click this link[/url]
    For tips on how to post your problems[/url]

  • Thats something what I am doing, and I am doing it in some definite number of sets from the three tables, but the problem is that large set of data.

    The place where it takes time is checking whether the data already exists in the table or not.

  • Actually

    since from BOL I found

    By default, the UNION operator removes duplicate rows from the result set. If you use ALL, all rows are included in the results and duplicates are not removed.

    Why not just

    INSERT INTO TABLED

    SELECT col1 FROM TABLEA

    UNION

    SELECT col1 FROM TABLEB

    UNION

    SELECT col1 FROM TABLEC

    3 Billion records is going to take time no matter how you cut it... once you get the data in there, you might want to add a trigger to table's A,B,C to insert to Table D as records are added...



    --Mark Tassin
    MCITP - SQL Server DBA
    Proud member of the Anti-RBAR alliance.
    For help with Performance click this link[/url]
    For tips on how to post your problems[/url]

  • Having an index on col1 in Table D might help to an extent. You might already be doing that...if so ignore this message

  • The "UNION" solution will work but will require a temporary table whose size will be the number of rows from all three tables. Here are 3 SQL Statements that can be used instead that will require less tempdb space. Best performance will need a unique index on col1 in Tables B and C and that there are no indicies on TableD.

    INSERT INTO TableD (col1)

    SELECT col1 FROM TableA;

    go

    -- add rows from B not in A

    INSERT INTO TableD (col1)

    SELECT col1 FROM TableB

    where not exists

    (select 1

    from TableA

    where TableA.col1 = TableB.col1

    )

    go

    -- add rows from C not in A and not in B

    INSERT INTO TableD (col1)

    SELECT col1 FROM TableC

    where not exists

    (select 1

    from TableA

    where TableA.col1 = TableC.col1

    )

    AND not exists

    (select 1

    from TableB

    where TableB.col1 = TableC.col1

    )

    SQL = Scarcely Qualifies as a Language

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply