December 21, 2010 at 3:06 pm
I had a quick question (though I am sure there are no straight answers for this)... is there a huge performance problem with running a query such as:
INSERT INTO dbo.Table1 ---- 6,000,000 existing records
SELECT *
FROM dbo.Table1_Temp ---- 200,000 new records, some possible duplicates
EXCEPT
SELECT * FROM dbo.Table1 ---- Change this to use only the columns covered in the primary index?
We are creating a new software system to put data (nightly) into a large database. The new system uses temp tables (for other processing) before putting the data into the normalized table(s). The query above would be repeated 30 times for the various normalized table.
The alternative is to run a lot of queries to detect duplicates before the insert, but, due to the dataset and normalization, that would involve a lot of coding on each table to confirm and eliminate the duplicates.
At this point, this is a high-level, "strategy"/"best practices" type of question, not a "what exactly are you trying to do" question. What are are trying to do is almost exactly the code listed above... 🙂
Argue for your limitations, and sure enough they're yours (Richard Bach, Illusions)
December 22, 2010 at 5:02 am
You're right that there are no simple answers that are always applicable, but there are some guidelines.
First, using EXCEPT is almost never going to be the best way to accomplish what you want, though it would be improved by only comparing keys from a unique index (often the primary key).
In general, what you are doing is best done in SSIS - the tool designed for high-volume extract/transform/load scenarios. Do the de-duplication, validation and other cleansing in an SSIS package. You can use the Lookup or Merge transformations to determine if a row already exists in the destination table or not. This can be made to be extremely efficient and make effective use of parallelism.
Even more generally, the whole thing stands or falls depending on your specific data and requirements, and the design of your tables. Many high-performance data loading systems use partitioning, minimally-logged fast loads, and partition switching to add large numbers of new rows.
If you want to provide more details about the data to be loaded (and how it might be partitioned) you'll get a better answer than this one.
December 22, 2010 at 5:12 am
This was removed by the editor as SPAM
Viewing 3 posts - 1 through 2 (of 2 total)
You must be logged in to reply to this topic. Login to reply