September 9, 2010 at 2:08 am
I have a database of source records amount to currently 91 million records and I expect a similar growth each month.
I am finding performance of my process disappointing and would be grateful if anyone can comment on whether I am missing something obvious.
My source data is loaded using SSIS from a tab separated file and one field contains categories - one or more of, comma separated, no maximum number - the load process works fine, it is the process for taking the string of categories and adding new ones to a lookup table and creating a mapping table to map the destination record id against category id that is bugging me.
So for instance my source record might be
Name<tab>Address<tab>Categories
where categories could be "CatA,CatB,CatC" or "CatA" etc and my destination tables are
Source : ID,Name,Address,Category_String (this is where I load the file into)
Destination : ID, Name, Address
Category : ID, Description
Mapping : Destination_ID, Category_ID
I've found that I actually only get 201 combinations of categories in the 91 million records so I go through those distinct combinations and check for new values for the lookups and replace the text strings with lookup IDs so instead of "CatA, CatB, CatC" I have "1,2,3"
I then have to go through each record in the source table (with a cursor) in the following way :
for each record
while category exists
get next category from source.category_string
insert Destination_ID, Category_ID into Mapping
until no more cateories in source.category_string
end
I know that's quite a simplification but essentially for each record I have to walk along the category string and then for each category insert a mapping record. I do that by using
if charindex( ',' , category_string) = 0 then on the last category else this_category = substring(category_string ...)
If you've read this far, thank you for bearing with me - I'm wondering if there is a more efficient way of handling the comma separated string than what I've already got - I guess part of the problem is that I don't know how many categories I will get (there are currrently over 100 distinct categories).
I've tinkered with various indexes and have decent improvements, I've broken the process down so I don't attempt to deal with all the records at once (currently I do batches and use transactions so the proces is restartable without having to reload the data from scratch), I also delete the source data as I finish with it (as part of the transaction) and all of these things are helping, I also realise I could change my SSIS package to process more than one batch at a time - so I'm not worried about the overall process it's just that mapping of a comma separated string in a single field to actual database content that is bothering me.
I just can't help but think that I've potentially created something more complicated than it should be!
Many thanks
September 9, 2010 at 2:46 am
Surely this would be tons easier and quicker if you were to normalise out the categories quite early on?
DROP TABLE #Temp
CREATE TABLE #Temp (ID INT IDENTITY(1,1), Name VARCHAR(30), [Address] VARCHAR(60), Category_String VARCHAR(2000))
INSERT INTO #Temp (Name, [Address], Category_String)
SELECT 'David Cameron', '10 Downing Street', 'CatA,CatB,CatC' UNION ALL
SELECT 'Sam Cameron', '10 Downing Street', 'CatA,CatC,CatF'
SELECT s.ID, Name, s.[Address], s.Category_String, iTVF.[Category]
FROM #Temp s
CROSS APPLY (
SELECT SUBSTRING(d.Category_String, d.b, d.e - d.b) AS [Category]
FROM (
SELECT s.Category_String,
n.n AS b, -- delimiter at beginning of string
ISNULL(NULLIF(CHARINDEX(',', s.Category_String, n.n+1), 0), LEN(s.Category_String)+1) AS e -- delimiter at end of string
FROM (SELECT TOP 200 [n] = ROW_NUMBER() OVER(ORDER BY [name]) FROM master.dbo.syscolumns) n -- sequential number/row generator
WHERE SUBSTRING(','+s.Category_String+',', n.n, 1) = ','
AND n.n <= LEN(s.Category_String)
) d
) iTVF
ORDER BY s.ID, iTVF.[Category]
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
September 9, 2010 at 3:01 am
Thanks - into a world I know nothing about so will be reading your articles shortly - will let you know how I get on.
September 10, 2010 at 7:46 am
I'd suggest reading through this article[/url] from Jeff Moden. He really lays down some great methods for going through comma seperated fields like that.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
September 10, 2010 at 7:54 am
Will do - I gave the other articles a good reading and the block of code from Chris is still looking a little like black magic to me to be honest, however, running it on my databse it clearly works and although I've not used it for that because there are a few other steps I go through I didn't mention (they didn't seem material at the time!) which kind of spoil things.
On the other hand, having looked at the code it has introduced partitioning to me which I was blissfully unaware of and it seems likely that is going to revolutionise the solution to my other problem which is brilliant.
September 10, 2010 at 8:11 am
Jeff's excellent article will explain string splitting - if you're unsure of anything, just ask.
This
(SELECT TOP 200 [n] = ROW_NUMBER() OVER(ORDER BY [name]) FROM master.dbo.syscolumns) n
is a mini tally table generated on-the-fly.
Don't forget to post table ddl and insert scripts to guarantee tested code.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
September 10, 2010 at 8:18 am
Thanks chap - code probably too embarassing to post!;-)
September 10, 2010 at 8:24 am
Balance (9/10/2010)
Thanks chap - code probably too embarassing to post!;-)
If you think there's room for improvement, then post it - don't be shy, we're all learners here.
Anyway, the coffee lands on this monitor, not yours 😀
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Viewing 8 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply