November 15, 2012 at 3:17 am
Hi guys,
I am looking after / building a solution that loads data from flat files using bulk insert.
The file contains a unique reference (PolicyID) which is in each file but the system architect has requested that each file load is uniquely identified. To do this we have added Validfrom and ValidTo columns to the table. The Bulk Insert now uses a format file to skip these columns as they are not in the source file.
The whole process is
Update the current records which have a dummy end date (31st Dec 2099) with today's date.
Bulk insert the file with the ValidFrom and ValidTo dates being NULL.
Update the new records with ValidFrom = todays date and ValidTo = dummy date.
I have an index built (Unique but not a PK and not clustered) PolicyID (ASC),ValidFrom(ASC),ValidTo(ASC)
The data load seems to be reasonably quick; But I think the index update is taking a huge performance hit. The file is about 750K records and the first update will be updating records at the end of the index pages, then the load will load with NULL values which I am assuming are going into the front of the index, then being rewritten to go to the end of the index. It this a reasonable assumption and if so,
how do I measure it whilst the script is running, and is it worth turning off the index whist doing the load and then rebuilding it. The data will be loaded on a weekly basis and I need to replicate the process across about 20 files of differing sizes.
I will normally only be reporting on records from the most current insert, looking for a specific policyID, so should I change the order or direction of the index fields: PolicyID and either ValidFrom or ValidTo should be enough to give me a unique key. Is it worth dropping either of the fields from the index.
November 15, 2012 at 4:04 am
It's quite common to drop or disable nonclustered indexes before a data load and rebuild or recreate them afterwards.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
November 15, 2012 at 4:33 am
Are the ValidFrom and ValidTo Date or DateTime, as you couldput default constraints on them.
That way you dont have to run the update after the load and only have to rebuild the index after the load is complete.
_________________________________________________________________________
SSC Guide to Posting and Best Practices
November 15, 2012 at 4:56 am
Thanks Jason, I am going to investigate that option a bit further. It seems to make a lot of sense as it involved only hitting the record once (when it is created)
My concern is that all 20 files need to load with the same FROM date even if the load runs over midnight (I have no idea when it will be run or how long it will take on the production server) and although I can use standard CASE WHEN THEN END logic to roll the date back if we go over midnight, I really want to be able to set the default from a variable value (which you obviously can't do because then it wouldnt be a default :w00t:)
The Bulk Insert script is called 20 times from an SSIS package and the From Date would be the package start date
November 15, 2012 at 5:06 am
If it can load over midnight then you might have a problem.
Just a thought but cant you alter the default constraint at the start of each load process so that rather than doing a GetDate() it uses a fixed date for that batch.
_________________________________________________________________________
SSC Guide to Posting and Best Practices
November 15, 2012 at 5:52 am
I don't think so, the default has to be set as part of the DDL for the table and you can't use variables in that situation.
The SSIS package calls a stored procedure which builds a dynamic Bulk Insert script which is then executed with exec sp_executesql() I do this because the date formats in the source file are DMY which you can set in t-SQL but not with the Bulk Insert Task in SSIS. I opted not to go down the Data Transfomation task route as I have already built the format files and the only transform I need to do is on the date; it didn't seem worth it to confuse future maintainers (I am a contrator) with having to remap the SSIS tasks each time the data changes - much easier to edit the BCP format files.
November 15, 2012 at 6:23 am
Fair point Aaron, it would mean dipping into Dynamic SQL to inject the value into the Alter table script.
_________________________________________________________________________
SSC Guide to Posting and Best Practices
Viewing 7 posts - 1 through 6 (of 6 total)
You must be logged in to reply to this topic. Login to reply