February 26, 2003 at 9:36 pm
As a newbie SQL programmer I have been asked to add 7M records into an existing table but I have been asked not to disturb existing records in the event of duplicate data. As a long time programmer with non record set databases I need some guidance on how to accomplish this. Currently I have the data in two tables A (original) and B (new data with some data matching A). My idea is to delete any matches on table B and then add table B records to table A. How can this be done via SQL?
I would also appreciate any suggestions on books or materials to rapidly ramp up to the plate of "real world" SQL joins and set logic manipulations to solve problems. Thanks!
February 27, 2003 at 3:23 am
For this type of operation I use
INSERT INTO TableA
SELECT b.*
FROM TableB b
LEFT OUTER JOIN TableA a ON a.key = b.key
WHERE a.key IS NULL
But for 7M rows it could take a while and performance, locking & logging would be an issue.
Maybe your way is better, to delete unwanted records first (after backup!) to find volume and then insert the data in blocks.
Edited by - davidburrows on 02/27/2003 06:37:54 AM
Far away is close at hand in the images of elsewhere.
Anon.
February 27, 2003 at 5:20 am
Also, to keep speed up if you have time to allow for, drop your indexes then add back after done, plus disable Triggers and Constraints on TableA if TableB will already be handled in those regards.
February 27, 2003 at 5:14 pm
I use a similar methodology as David. I've found that in most cases (there have been some exceptions), the left outer join insert runs with less overhead than a delete where exists and insert. There happens to be a major difference in these methods though, in that the left outer join will leave you with the origional values of duplicated rows, where the delete then insert will leave you with the new values of the duplicated rows. In order to compare these in a more even light, in the left outer join scenario, you would need another update or delete to take care of updating the duplicated rows values to the new ones.
The way I handle this is to verify an incrementing identity field or a timestamp for update/insert field. Most tables have one or the other or both, but if not, create a field for a timestamp. Now insert all of your records with one query, and the second one should remove the oldest of the duplicates based on the timestamp or Identity comparison. This also has the side benefit of guaranteeing that only duplicate data could be deleted from your table, and you could never loose data from a failed insert not putting the new values in after the old ones were deleted out. By creating a list of fields for the duplicated row check, (your true primary key ), and eliminating the oldest row, your left with the newer values. This reduces the overhead of the insert of the data into the table as there is no outer joins, just a straight write of everything, and the delete is now performed with an aggregate rather than a join as well. With highly selective indexes, perhaps covered, this can scream (the aggregate functions see some of the highest benefit from good indexes). Even so, the limit will always be whatever your Disk IO allows, as you can't get around a basic read, write and delete of the data involved. And, as Antares mentioned, dropping the indexes for the insert is capable of eliminating a large amount of disk IO. Which will definately make more of a difference if you're IO bound.
I think that David's method is what is actually wanted though, And to perform the function of the topic, would work the best.
Edited by - scorpion_66 on 02/27/2003 5:16:58 PM
February 27, 2003 at 6:50 pm
Thank you all.
I deleted the duplicate records from table 2 and then added the remaining 6.6 million records in 43 minutes.
I have a question. I tried in a small sample database this code:
INSERT INTO TableA
SELECT b.*
FROM TableB b
LEFT OUTER JOIN TableA a ON a.key = b.key
WHERE a.key IS NULL
and it worked for the sample when both the file and fields were not the same. When I tried in the actual data I had problems with the fields because they were named the same. I don't understand why it would work one way and not the other. I was modifying the same table name in a diferent database say analytics.dbo.filename1 and aer.dbo.filename1 but both files field names where exactly the same.
February 28, 2003 at 2:17 am
Curious. I don't have problems with code like this. You can have any combination of table names and field names in the join (as long they are of the same data type) but you need to alias all tables and fields.
Far away is close at hand in the images of elsewhere.
Anon.
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply