Unique Data Generalization

Question

Post reply

Unique Data Generalization

Tarun Yadav

SSC Veteran

Points: 247
More actions
July 23, 2008 at 10:18 am

#210417

I have three tables (source) with same col's. Each table have approx. one billion rows.
Table A (id, col1) [col1 is unique in Table A]
Table B (id, col1) [col1 is unique in Table B]
Table C (id, col1) [col1 is unique in Table C]
The data in col1 is unique for that respective table but could be duplicated across Table A, in Table B, in Table C
I want to combine these three tables in a single / destination table,Table D (id, col1) [col1 is unique in Table D].
In Table D, col1 will contain unique values from Table A (col1), Table B (col1), Table C (col1).
How i am doing right is getting 5000 records from Table A, Table B, and Table C in a temp table, selecting distinct against that temp table and inserting the records into Table D by checking if the record does not exist in Table D.
I have tried various combination of numbers to select from the three tables (source).
The problem is that this approach works fine until the time Table D has few number of records (say around 50 million), but when the records inside table Table D grows, then it takes lots of time.
I was wondering if there is more efficient approach to come over this problem.
Any help appreciated.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply

mtassin SSC-Insane Points: 23099 More actions · Answer 1

I don't have 3 billion record tables to test with but...

Wouldn't something like this work?

INSERT INTO TABLED

SELECT

main.col1

FROM

(

SELECT col1 FROM TABLEA

UNION

SELECT col1 FROM TABLEB

UNION

SELECT col1 FROM TABLEC

) AS main

GROUP BY main.col1

--Mark Tassin
MCITP - SQL Server DBA
Proud member of the Anti-RBAR alliance.
For help with Performance click this link[/url]
For tips on how to post your problems[/url]

Tarun Yadav SSC Veteran Points: 247 More actions · Answer 2

Thats something what I am doing, and I am doing it in some definite number of sets from the three tables, but the problem is that large set of data.

The place where it takes time is checking whether the data already exists in the table or not.

mtassin SSC-Insane Points: 23099 More actions · Answer 3

Actually

since from BOL I found

By default, the UNION operator removes duplicate rows from the result set. If you use ALL, all rows are included in the results and duplicates are not removed.

Why not just

INSERT INTO TABLED

SELECT col1 FROM TABLEA

UNION

SELECT col1 FROM TABLEB

UNION

SELECT col1 FROM TABLEC

3 Billion records is going to take time no matter how you cut it... once you get the data in there, you might want to add a trigger to table's A,B,C to insert to Table D as records are added...

--Mark Tassin
MCITP - SQL Server DBA
Proud member of the Anti-RBAR alliance.
For help with Performance click this link[/url]
For tips on how to post your problems[/url]

chandra.pottipati SSC Veteran Points: 248 More actions · Answer 4

Having an index on col1 in Table D might help to an extent. You might already be doing that...if so ignore this message

Carl Federl One Orange Chip Points: 25384 More actions · Answer 5

The "UNION" solution will work but will require a temporary table whose size will be the number of rows from all three tables. Here are 3 SQL Statements that can be used instead that will require less tempdb space. Best performance will need a unique index on col1 in Tables B and C and that there are no indicies on TableD.

INSERT INTO TableD (col1)

SELECT col1 FROM TableA;

go

-- add rows from B not in A

INSERT INTO TableD (col1)

SELECT col1 FROM TableB

where not exists

(select 1

from TableA

where TableA.col1 = TableB.col1

)

go

-- add rows from C not in A and not in B

INSERT INTO TableD (col1)

SELECT col1 FROM TableC

where not exists

(select 1

from TableA

where TableA.col1 = TableC.col1

)

AND not exists

(select 1

from TableB

where TableB.col1 = TableC.col1

)

SQL = Scarcely Qualifies as a Language