February 7, 2006 at 3:44 pm
I have a table Claimdata with about 19 million rows. This table is imported weekly from SAS. The Import takes about an hour using the Transform Data Task and then the following update is run using the Execute SQL Task which takes 4-5 hours to complete.
UPDATE Claimdata
SET Claimdata.LagCategory = FinancialCategories.LagCategory
FROM FinancialCategories
INNER JOIN
FinancialCategories.ClaimType = Claimdata.ClaimType
Here’s the FinancialCategories DDL
CREATE TABLE [FinancialCategories] (ClaimType varchar (50) ,
LagCategory varchar (50) ,
FinancialCategory varchar (50)
) ON [PRIMARY]
GO
There are no indexes on any of the fields involved in the query. Does anyone have any suggestions on how I could reduce the time it takes to UPDATE this field? I really appreciate any advice on this. Thanks !
February 7, 2006 at 4:01 pm
also, there are only about 15 rows in FinancialCategories. Maybe I should just put it in a function?
February 7, 2006 at 4:27 pm
1. Since there is no clustered index, the table may be a highly fragmented heap, requiring more disk I/O than necessary
2. If the database is running in full recovery mode, the UPDATE will be generating large amounts of log activity
3. If the column Claimdata.LagCategory is initially loaded as NULL, and then set to a value, this prevents "Update in place", so each row UPDATE will actually be implemented as a DELETE followed by an INSERT behind the scenes. If #2 is true, then the combination of #2 & #3 is a real performance killer
February 7, 2006 at 4:50 pm
Thanks for the reply PW.
1.) Do you mean clustered index on FinancialLagCategories or Claimdata?
2.) It is running in Simple recovery mode.
3.) Interesting, there is no transformation mapped to that field, so you're right, it probably is putting NULL into that column. How can I get around this?
Thanks !
February 8, 2006 at 6:54 am
You could try putting a transformation from any field into Claimdata.LagCategory then choose activex script and put a fixed literal in it equal to the length of the longest value that will be placed in this fileld. Or you could use a DEFAULT on the Column definition.
If you put a Default on I would try to do it when Claimdata is empty (if poss).
If LagCategory will not change then you could look it up through your join on FinancialCategories
February 8, 2006 at 8:10 am
I would recommend to start by setting up an index on ClaimType on both tables. Depending on the data you get it could be clustered index. Having said that, if there is a date field on Claimdata table, I would use that for Clustered index in combination with ClaimType and break the update by date range.
The general rule is when you building an index, your goal is to decrease number of records SQL will search through, ideally jumping straight to the record it needs. Also, if you break update statement by date it will help too.
Hope this helps.
Vadim.
February 8, 2006 at 1:28 pm
A clustered index on the large table should help performance, but as long as you use some index on the referenced columns you should get better performance.
Also adding this to the beginning
set rowcount 100,000
and this to the end
where Claimdata.LagCategory != FinancialCategories.LagCategory
then either wrap it in a cursor, or use Sql Agent to run 10 or 20 updates with no cursor.
This should keep your transactions small and limit the size of each transaction nicely. You should change the rowcount to fit your environment.
If you want it automated, I'd recommend the cursor, once you've sized the rowcount accordingly.
February 8, 2006 at 4:20 pm
What about not running the update at all?
What is the cost of modifying the views? that reference the data so that the join above is utilised.
Not running your update would leave your data normalised to so extent.
February 9, 2006 at 11:51 am
Thanks for all the replies ! The Clustered Index def. speeds up the update, but it takes a long time to add the index to the table (all indexes are dropped before the load)
What about using a Lookup in the Transform data task. I've never done one before, so 1.)do you think this would be a good idea? and 2.) Would this be the right way to go about doing it? Thanks !
Using the Lookup Query below:
SELECT FinancialCategories.LagCategory
FROM FinancialCategories INNER JOIN CLAIMDATASET ON FinancialCategories.ClaimType = CLAIMDATASET.ClaimType
WHERE (CLAIMDATASET.ClaimType = ?)
Function Main()
DTSDestination("LagCategory") = DTSLookups("lkp_LagCategory").Execute(DTSSource("ClaimType"))
Main = DTSTransformStat_OK
End Function
February 9, 2006 at 12:01 pm
Have you considered leaving the Clustered index in place and just keeping it at a low fill factor/Padding?
This way your data should load almost as fast, but you won't have to rebuild any idexes, just add the NC indices that support your update. When that is done, add the rest back.
19 milion rows isn't very much. Maybe you need to look at your Disk Drive configuration and available memory too. A standard table with 19 million rows should take 10-15 minutes max to add a one or two column Clustered index.
February 9, 2006 at 12:03 pm
Lookups can be useful if you need to look something up on a different data source.
With 2 tables in the same data source, a lookup will nearly always be slower. You can always benchmark it, just remember to set the cache size large enough.
>>
SELECT FinancialCategories.LagCategory
FROM FinancialCategories INNER JOIN CLAIMDATASET ON FinancialCategories.ClaimType = CLAIMDATASET.ClaimType
WHERE (CLAIMDATASET.ClaimType = ?)
This looks like different functionality than the original UPDATE.
Where did table CLAIMDATASET enter the picture, and what size is it ?
February 10, 2006 at 6:57 am
A couple of things that we have done with similar issues are
1. use with (tablockx) to avoid the use of multiple page locks (if this is a problem you will see screenfuls of page locks building up on tempdb)
2. If possible split the update into several updates - for example write a cursor to pick up each category, then for each category update the table. The aim being to split the 19 million rows into 19 x 1 million. This can work because you're less likely to be over-stretching the memory.
February 11, 2006 at 11:22 pm
You could also try avoiding the update altogether by importing your SAS data into a staging table, then having a second transformation that inserts the joined data into Claimdata i.e with transformation source SQL such as:
SELECT StagingClaimdata.<attributes>, FinancialCategories.LagCategory
FROM StagingClaimdata
INNER JOIN FinancialCategories
ON FinancialCategories.ClaimType = StagingClaimdata.ClaimType.
If the import from SAS is taking about an hour, this should take much the same time.
Viewing 13 posts - 1 through 12 (of 12 total)
You must be logged in to reply to this topic. Login to reply