T-SQL code performance - am I missing something obvious

  • I have a database of source records amount to currently 91 million records and I expect a similar growth each month.

    I am finding performance of my process disappointing and would be grateful if anyone can comment on whether I am missing something obvious.

    My source data is loaded using SSIS from a tab separated file and one field contains categories - one or more of, comma separated, no maximum number - the load process works fine, it is the process for taking the string of categories and adding new ones to a lookup table and creating a mapping table to map the destination record id against category id that is bugging me.

    So for instance my source record might be

    Name<tab>Address<tab>Categories

    where categories could be "CatA,CatB,CatC" or "CatA" etc and my destination tables are

    Source : ID,Name,Address,Category_String (this is where I load the file into)

    Destination : ID, Name, Address

    Category : ID, Description

    Mapping : Destination_ID, Category_ID

    I've found that I actually only get 201 combinations of categories in the 91 million records so I go through those distinct combinations and check for new values for the lookups and replace the text strings with lookup IDs so instead of "CatA, CatB, CatC" I have "1,2,3"

    I then have to go through each record in the source table (with a cursor) in the following way :

    for each record

    while category exists

    get next category from source.category_string

    insert Destination_ID, Category_ID into Mapping

    until no more cateories in source.category_string

    end

    I know that's quite a simplification but essentially for each record I have to walk along the category string and then for each category insert a mapping record. I do that by using

    if charindex( ',' , category_string) = 0 then on the last category else this_category = substring(category_string ...)

    If you've read this far, thank you for bearing with me - I'm wondering if there is a more efficient way of handling the comma separated string than what I've already got - I guess part of the problem is that I don't know how many categories I will get (there are currrently over 100 distinct categories).

    I've tinkered with various indexes and have decent improvements, I've broken the process down so I don't attempt to deal with all the records at once (currently I do batches and use transactions so the proces is restartable without having to reload the data from scratch), I also delete the source data as I finish with it (as part of the transaction) and all of these things are helping, I also realise I could change my SSIS package to process more than one batch at a time - so I'm not worried about the overall process it's just that mapping of a comma separated string in a single field to actual database content that is bothering me.

    I just can't help but think that I've potentially created something more complicated than it should be!

    Many thanks

  • Surely this would be tons easier and quicker if you were to normalise out the categories quite early on?

    DROP TABLE #Temp

    CREATE TABLE #Temp (ID INT IDENTITY(1,1), Name VARCHAR(30), [Address] VARCHAR(60), Category_String VARCHAR(2000))

    INSERT INTO #Temp (Name, [Address], Category_String)

    SELECT 'David Cameron', '10 Downing Street', 'CatA,CatB,CatC' UNION ALL

    SELECT 'Sam Cameron', '10 Downing Street', 'CatA,CatC,CatF'

    SELECT s.ID, Name, s.[Address], s.Category_String, iTVF.[Category]

    FROM #Temp s

    CROSS APPLY (

    SELECT SUBSTRING(d.Category_String, d.b, d.e - d.b) AS [Category]

    FROM (

    SELECT s.Category_String,

    n.n AS b, -- delimiter at beginning of string

    ISNULL(NULLIF(CHARINDEX(',', s.Category_String, n.n+1), 0), LEN(s.Category_String)+1) AS e -- delimiter at end of string

    FROM (SELECT TOP 200 [n] = ROW_NUMBER() OVER(ORDER BY [name]) FROM master.dbo.syscolumns) n -- sequential number/row generator

    WHERE SUBSTRING(','+s.Category_String+',', n.n, 1) = ','

    AND n.n <= LEN(s.Category_String)

    ) d

    ) iTVF

    ORDER BY s.ID, iTVF.[Category]

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Thanks - into a world I know nothing about so will be reading your articles shortly - will let you know how I get on.

  • I'd suggest reading through this article[/url] from Jeff Moden. He really lays down some great methods for going through comma seperated fields like that.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Will do - I gave the other articles a good reading and the block of code from Chris is still looking a little like black magic to me to be honest, however, running it on my databse it clearly works and although I've not used it for that because there are a few other steps I go through I didn't mention (they didn't seem material at the time!) which kind of spoil things.

    On the other hand, having looked at the code it has introduced partitioning to me which I was blissfully unaware of and it seems likely that is going to revolutionise the solution to my other problem which is brilliant.

  • Jeff's excellent article will explain string splitting - if you're unsure of anything, just ask.

    This

    (SELECT TOP 200 [n] = ROW_NUMBER() OVER(ORDER BY [name]) FROM master.dbo.syscolumns) n is a mini tally table generated on-the-fly.

    Don't forget to post table ddl and insert scripts to guarantee tested code.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Thanks chap - code probably too embarassing to post!;-)

  • Balance (9/10/2010)


    Thanks chap - code probably too embarassing to post!;-)

    If you think there's room for improvement, then post it - don't be shy, we're all learners here.

    Anyway, the coffee lands on this monitor, not yours 😀

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply