July 14, 2008 at 7:35 am
Hi,
we have an ETL process that is running everynight, and we get data (text files) about say 4 million rows. And only about 300 thousand rows have been updated, so we have to do a delete statment for the rest of the rows, which is taking a while to execute. Do you all know any other way around these? We thought of cursors but same overhead for cursors are high also. Also temp table same way, temp db would be filled, same performance issue.
Thanks,
FK
July 14, 2008 at 7:45 am
Cursors = really, really bad idea
Could you describe what happens in a little more detail please?
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
July 14, 2008 at 7:49 am
Agree, cursors are bad and we are staying away from those!!
The current delete statment takes about 872729 ms, which is kind of high, and what the delete does is delete the unchanged data (3.5 million rows) and keep the changed once .5 million rows.
July 14, 2008 at 7:58 am
Could you describe the ETL process a bit more please?
You may find it more efficient to copy the rows you want to keep to a temp table, truncate the original table and copy the rows back.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
July 14, 2008 at 8:06 am
Wel, I think that should work. Thanks.
Our ETL process the following..
We have mainframe files that get loaded at night, and basically the ETL will load the file into a table, once there using slowly changing dimension we check for various fileds if they were updated or what needs to be updated, so for the once we don't want to do anything, we delete based on one column flag, and then continue to load the rest to the appropriate fact tables. So that delete step was taking a while, we created an index (non-clustered) on it but still did not help, I will try to create a temp table or even a regular table, and do the truncate. Make sense. Thanks.
July 14, 2008 at 10:58 am
Which column did you create an index? How big is the table? If the table is big and there are many indexes on the table, it will take a while to delete it.
Did you mention you use cursor or just use the 'Delete' statement?
DELETE x
FROM Tablex x
INNER JOIN temptable t ON x.col1 = t.col1
WHERE x.Delcol = 'Y'
July 14, 2008 at 11:03 am
The table is about 4 Million rows, very big, but the temp table idea seems to be working, thanks.
July 14, 2008 at 11:47 pm
Generally, if you're deleting up to half the table (maybe even more) it's faster to save the rows you want, then trunate the table than to run the delete.
Less impact on the tran log too, since truncate is a minimally logged operation.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
July 15, 2008 at 3:09 pm
i was in a project in which we had the same issue but with excel.. what we did was all the sql code validation or operations that we had in the ETL was moved to stored procedures and we reduced the time to almost %80.it worked. and also if the file has information that is not needed it could be deleted before doing any operation.
July 16, 2008 at 7:12 am
The truncate solution seems best, but if you really get stuck having to delete millions of rows, one technique that has worked for me is setting a rowcount of something like 50000 or 100000 and looping through the delete statement until count = 0. Keeps the tran log manageable. Each delete statement usually would take a few seconds, and even if you loop through 30-40 times, total time may only be a few minutes.
July 16, 2008 at 7:54 am
Be aware that the SCD transform in SSIS does generally not scale well on large datasets.
Viewing 11 posts - 1 through 10 (of 10 total)
You must be logged in to reply to this topic. Login to reply