July 23, 2008 at 10:18 am
I have three tables (source) with same col's. Each table have approx. one billion rows.
Table A (id, col1) [col1 is unique in Table A]
Table B (id, col1) [col1 is unique in Table B]
Table C (id, col1) [col1 is unique in Table C]
The data in col1 is unique for that respective table but could be duplicated across Table A, in Table B, in Table C
I want to combine these three tables in a single / destination table,Table D (id, col1) [col1 is unique in Table D].
In Table D, col1 will contain unique values from Table A (col1), Table B (col1), Table C (col1).
How i am doing right is getting 5000 records from Table A, Table B, and Table C in a temp table, selecting distinct against that temp table and inserting the records into Table D by checking if the record does not exist in Table D.
I have tried various combination of numbers to select from the three tables (source).
The problem is that this approach works fine until the time Table D has few number of records (say around 50 million), but when the records inside table Table D grows, then it takes lots of time.
I was wondering if there is more efficient approach to come over this problem.
Any help appreciated.
July 23, 2008 at 10:28 am
I don't have 3 billion record tables to test with but...
Wouldn't something like this work?
INSERT INTO TABLED
SELECT
main.col1
FROM
(
SELECT col1 FROM TABLEA
UNION
SELECT col1 FROM TABLEB
UNION
SELECT col1 FROM TABLEC
) AS main
GROUP BY main.col1
July 23, 2008 at 10:46 am
Thats something what I am doing, and I am doing it in some definite number of sets from the three tables, but the problem is that large set of data.
The place where it takes time is checking whether the data already exists in the table or not.
July 23, 2008 at 11:07 am
Actually
since from BOL I found
By default, the UNION operator removes duplicate rows from the result set. If you use ALL, all rows are included in the results and duplicates are not removed.
Why not just
INSERT INTO TABLED
SELECT col1 FROM TABLEA
UNION
SELECT col1 FROM TABLEB
UNION
SELECT col1 FROM TABLEC
3 Billion records is going to take time no matter how you cut it... once you get the data in there, you might want to add a trigger to table's A,B,C to insert to Table D as records are added...
July 23, 2008 at 11:40 am
Having an index on col1 in Table D might help to an extent. You might already be doing that...if so ignore this message
July 23, 2008 at 11:57 am
The "UNION" solution will work but will require a temporary table whose size will be the number of rows from all three tables. Here are 3 SQL Statements that can be used instead that will require less tempdb space. Best performance will need a unique index on col1 in Tables B and C and that there are no indicies on TableD.
INSERT INTO TableD (col1)
SELECT col1 FROM TableA;
go
-- add rows from B not in A
INSERT INTO TableD (col1)
SELECT col1 FROM TableB
where not exists
(select 1
from TableA
where TableA.col1 = TableB.col1
)
go
-- add rows from C not in A and not in B
INSERT INTO TableD (col1)
SELECT col1 FROM TableC
where not exists
(select 1
from TableA
where TableA.col1 = TableC.col1
)
AND not exists
(select 1
from TableB
where TableB.col1 = TableC.col1
)
SQL = Scarcely Qualifies as a Language
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply