Best way to normalize

  • The SELECT query below returns the desired resultset but it has to "look back" at the data to find the Category for each row. I could have done an explicit loop, but figured I should at least be considerate enough to hide my RBAR 😀 I am confident this technique will slow down significantly as the data set grows. Is there a "better" way to write this query? I have a tally table and am on SQL 2008.

    DECLARE @Elements TABLE

    (

    Number INT ,

    Value VARCHAR(MAX)

    ) ;

    INSERT INTO @Elements

    ( Number ,

    Value

    )

    SELECT 1 AS [Number] ,

    N'Category A' AS [Value]

    UNION ALL

    SELECT 2 AS [Number] ,

    N'281,286' AS [Value]

    UNION ALL

    SELECT 3 AS [Number] ,

    N'281,283' AS [Value]

    UNION ALL

    SELECT 4 AS [Number] ,

    N'281,282' AS [Value]

    UNION ALL

    SELECT 5 AS [Number] ,

    N'281,282' AS [Value]

    UNION ALL

    SELECT 6 AS [Number] ,

    N'280,283' AS [Value]

    UNION ALL

    SELECT 7 AS [Number] ,

    N'Category B' AS [Value]

    UNION ALL

    SELECT 8 AS [Number] ,

    N'250,257' AS [Value]

    UNION ALL

    SELECT 9 AS [Number] ,

    N'265,256' AS [Value]

    UNION ALL

    SELECT 10 AS [Number] ,

    N'234,255' AS [Value]

    UNION ALL

    SELECT 11 AS [Number] ,

    N'249,254' AS [Value]

    UNION ALL

    SELECT 12 AS [Number] ,

    N'250,254' AS [Value]

    UNION ALL

    SELECT 13 AS [Number] ,

    N'Category C' AS [Value]

    UNION ALL

    SELECT 14 AS [Number] ,

    N'281,311' AS [Value]

    UNION ALL

    SELECT 15 AS [Number] ,

    N'218,220' AS [Value]

    UNION ALL

    SELECT 16 AS [Number] ,

    N'218,218' AS [Value]

    UNION ALL

    SELECT 17 AS [Number] ,

    N'219,221' AS [Value]

    UNION ALL

    SELECT 18 AS [Number] ,

    N'218,219' AS [Value] ;

    SELECT ( SELECT Value

    FROM @Elements

    WHERE Number = ( SELECT MAX(Number)

    FROM @Elements

    WHERE Number < e.Number

    AND Value LIKE 'Category%'

    )

    ) AS Category ,

    SUBSTRING(e.Value, 1, CHARINDEX(',', e.Value, 1) - 1) AS Value1 ,

    SUBSTRING(e.Value, CHARINDEX(',', e.Value, 1) + 1, 100) AS Value2

    FROM @Elements e

    WHERE e.VALUE NOT LIKE 'Category%' ;

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • I'm not sure what the data represents, but it appears that you'd be better off changing the schema rather than working on the query. Again, without knowing what the data represents, I'd suggest a Category table and then relate the Element table to the category table.

  • Lamprey13 (4/25/2011)


    I'm not sure what the data represents, but it appears that you'd be better off changing the schema rather than working on the query. Again, without knowing what the data represents, I'd suggest a Category table and then relate the Element table to the category table.

    The end goal is to put the data into a table that looks like this, but first I need to relate the category from a preceding row to the rows that follow it:

    CREATE TABLE dbo.CategoryInfoAboutSpecificMeasurements

    (

    CategoryID INT,

    SomeMeasurement INT,

    SomeOtherMeasurement INT

    ) ;

    The data in my sample comes from a file and using SSIS is not an option. I have to flatten it so I can normalize it. Also worth mentioning: there can be a different number of data rows between categories, i.e. unlike my test data the Category Name will not be in every 6th row of the data file.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • The query creates a couple of Triangular joins where one internal leg consists of 270 rows and the other consists of 135 rows. Please see the following article for why that's a bad thing.

    http://www.sqlservercentral.com/articles/T-SQL/61539/

    How many rows does this "staging" table normally contain?

    And, since it is just a staging table, can we add an extra column to it?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Alright... I'll just do like I normally do and test this type of thing on a million rows. Here's the code to make a million row test table like your original table. The key here is that this table is supposed to be a staging table of some sort. If that's true, then you should be able to add the 3 columns I made to speed things up and to allow for verification compared to the original data.

    Not to worry... this whole thing, including building the Clustered Index takes something less than 10 seconds on my 9 year old desktop. It should fly on a more modern machine. As usual, the details are in the comments.

    --========================================================================================

    -- Build the test data. Only the extra column we added is a part of the solution.

    --========================================================================================

    --===== Conditionally drop the test table to make reruns easier.

    IF OBJECT_ID('tempdb..#Elements','U') IS NOT NULL

    DROP TABLE #Elements

    ;

    --===== Declare how many rows we want in the test table.

    DECLARE @RowsToTest INT;

    SELECT @RowsToTest = 1000000;

    --===== Now, build the test table and populate it on the fly.

    -- Not to worry... this takes less than 5 seconds.

    WITH

    cteGenRows AS

    (

    SELECT TOP (@RowsToTest)

    Number = CAST(ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS INT),

    [Value] = CAST(ABS(CHECKSUM(NEWID()))%1000 AS VARCHAR(3)) + ','

    + CAST(ABS(CHECKSUM(NEWID()))%1000 AS VARCHAR(3))

    FROM sys.all_columns ac1

    CROSS JOIN sys.all_columns ac2

    )

    SELECT Number = ISNULL(Number,0), --The ISNULL makes this column NOT NULL

    [Value] = CASE

    WHEN (Number-1)%(@RowsToTest/25) <> 0

    THEN [Value]

    ELSE 'Category '+CHAR((Number-1)/(@RowsToTest/25)+65)

    END,

    Category = CAST(NULL AS VARCHAR(100)), --This creates an empty column

    SomeMeasurement = CAST(NULL AS INT), --This creates an empty column

    SomeOtherMeasurement = CAST(NULL AS INT) --This creates an empty column

    INTO #Elements

    FROM cteGenRows

    ;

    --===== Add the quintessential Clustered Index

    -- This adds about 4 seconds to the job.

    ALTER TABLE #Elements

    ADD PRIMARY KEY CLUSTERED (Number) WITH FILLFACTOR = 100

    ;

    I tried your code on the million row table just to see if the Triangular joins changed up. They didn't and I stopped your code after 6 minutes.

    The following code updates the original staging table with all the information you asked for in less than 11 seconds. It uses a special update called a "Protected Quirky Update". I'm in the process of updating the article with the "Protected" part but here's the article about how it works (pay no attention to the explanation about the execution plan... it's wrong and will be corrected as well). [font="Arial Black"]Please notice that you have to follow the rules given or this method will burn you.[/font]

    http://www.sqlservercentral.com/articles/T-SQL/68467/

    Part of the rules will include the new safety feature that Paul White and Tom Thomson contributed. Gotta say it again... if you don't follow the rules in the article along with this new safety feature, don't use this method. Use a Cursor, instead. It'll still be a lot faster than the Triangular Joins.

    --========================================================================================

    -- "Smear" the data down using a protected Quirky Update

    -- This is the solution to the problem

    --========================================================================================

    --===== Declare some obviously named variables

    DECLARE @PrevCategory VARCHAR(100),

    @Anchor INT,

    @Safety INT

    ;

    SELECT @Safety = 1

    ;

    --===== Now, "Smear" the category names "downward".

    -- This also takes care of the split on measurements.

    -- This whole thing takes less than 12 seconds.

    UPDATE tgt

    SET @PrevCategory = Category

    = CASE

    WHEN Number = @Safety

    THEN CASE

    WHEN [Value] LIKE 'Category%'

    THEN [Value]

    ELSE @PrevCategory

    END

    ELSE CAST(1/0 AS VARCHAR(100)) --This forces an error if the Number and the Safety get out of sync

    END,

    @Anchor = Number,

    SomeMeasurement = CASE

    WHEN [Value] NOT LIKE 'Category%'

    THEN SUBSTRING([Value],1,CHARINDEX(',',[Value])-1)

    END,

    SomeOtherMeasurement = CASE

    WHEN [Value] NOT LIKE 'Category%'

    THEN SUBSTRING([Value],CHARINDEX(',',[Value])+1,8000)

    END,

    @Safety = @Safety + 1

    FROM #Elements tgt WITH (TABLOCKX)

    OPTION (MAXDOP 1)

    ;

    Then, just select whatever you need in the order you need it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • For what it is worth, here is an alternative to the Quirky Update also using Jeff's data (thanks for the data Jeff!). First here is a plain select:;WITH cte

    AS

    (

    SELECT Number, Value, ROW_NUMBER() OVER (ORDER BY Number) AS RowNum

    FROM #Elements

    WHERE Value LIKE 'Category%'

    )

    SELECT

    T.Value,

    CASE

    WHEN e.Value NOT LIKE 'Category%'

    THEN SUBSTRING(e.Value, 1, CHARINDEX(',', e.Value) - 1)

    END AS Value1,

    CASE

    WHEN e.Value NOT LIKE 'Category%'

    THEN SUBSTRING(e.Value, CHARINDEX(',', e.Value) + 1, 8000)

    END AS Value2

    FROM

    #Elements AS e

    INNER JOIN

    (

    SELECT

    A.Value,

    A.Number AS NumberStart,

    COALESCE(B.Number, 2147483647) AS NumberEnd

    FROM

    cte AS A

    LEFT OUTER JOIN

    cte AS B

    ON A.RowNum = B.RowNum - 1

    ) AS T

    ON e.Number > T.NumberStart

    AND e.Number < T.NumberEnd

    Next here is update that should produce the smae results as Jeff's update, but without using teh Quirky method:WITH cte

    AS

    (

    SELECT Number, Value, ROW_NUMBER() OVER (ORDER BY Number) AS RowNum

    FROM #Elements

    WHERE Value LIKE 'Category%'

    )

    UPDATE

    e

    SET

    SomeMeasurement =

    CASE

    WHEN e.Value NOT LIKE 'Category%'

    THEN SUBSTRING(e.Value, 1, CHARINDEX(',', e.Value) - 1)

    END,

    SomeOtherMeasurement =

    CASE

    WHEN e.Value NOT LIKE 'Category%'

    THEN SUBSTRING(e.Value, CHARINDEX(',', e.Value) + 1, 8000)

    END

    FROM

    #Elements AS e

    INNER JOIN

    (

    SELECT

    A.Value,

    A.Number AS NumberStart,

    COALESCE(B.Number, 2147483647) AS NumberEnd

    FROM

    cte AS A

    LEFT OUTER JOIN

    cte AS B

    ON A.RowNum = B.RowNum - 1

    ) AS T

    ON e.Number > T.NumberStart

    AND e.Number < T.NumberEnd On my lap top but updates run in the same amount of time.

  • Lamprey13, thanks for your take. Your solution took quite a bit longer than Jeff's solution on my machine so I timed it to get an idea of the differences...with IO and TIME STATISTICS ON:

    Jeff's:

    (1000000 row(s) affected)

    Table '#Elements_____________________000000000008'. Scan count 1, logical reads 43336, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 4407 ms, elapsed time = 6022 ms.

    (1000000 row(s) affected)

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    Yours:

    (1000000 row(s) affected)

    Table '#Elements_____________________000000000007'. Scan count 27, logical reads 3121652, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 24, logical reads 37156642, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 70656 ms, elapsed time = 72764 ms.

    (999975 row(s) affected)

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    I think this constitutes a triangular JOIN:

    ON A.RowNum = B.RowNum - 1

    My query does something similar, looking back at "previous rows" creating a nested loop and lots of additional IO, however your query provides a vast improvement over my initial attempt.

    Thanks for your posts Jeff, I figured I might be looking at a QU as the fastest way possible but wasn't 100% sure, hence why I posted 😀

    ...wow there are a lot of rules! I had previously read both the articles you linked to, but have resisted adopting the QU technique because 1) the number of rules (did I mention that already) and 2) the reliance on undocumented (sorry to say it...again...I know you know) behavior allowing the chance for a CU, SP or Upgrade to "break" the code. I do like the new 1/0 addition. That said, I need to take another look at QU. I look forward to your upcoming article too. Thanks!

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • opc.three (4/26/2011)


    I think this constitutes a triangular JOIN:

    ON A.RowNum = B.RowNum - 1

    I didn't go through everything, but that's not a triangular join. This is:

    A.RowNum < B.RowNum

    A common running totals workaround.

    EDIT: That's what I get for not reading everything. Here's the triangle you spotted:

    ON e.Number > T.NumberStart

    AND e.Number < T.NumberEnd


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • I don't think that a triangle. It's basically covering the range of values between two categories.

    I tried a similar approach in parallel but Lamprey13 did post his solution earlier. And I can confirm the performance metrics opc.three mentioned. Almost exactly, actually.

    So, the question comes down to the point: Does the effect of being 15 times faster than the ROW_NUMBER approach validate the rules to be followed to make the fast solution robust, too?

    I'd say: it depends. With a trend towards "yes". It depends on the actual number of rows and the frequency the query is called followed by the skills level available to maintain the code.



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • If you really aren't comfortable with the QU (I don't have a problem with it myself, but have yet to need it...), then a simpler pre-select and index gets the job done (building a new table that is) in a total duration of about about 2 secs using Jeff's sample data on my desktop.

    SELECT Number

    , VALUE

    INTO

    #categories

    FROM

    #Elements

    WHERE

    VALUE LIKE 'Cat%'

    CREATE UNIQUE CLUSTERED INDEX ix1 ON #categories(Number);

    SELECT cat.Category

    , SUBSTRING(e.value, 1, CHARINDEX(',', e.Value, 1) - 1) AS Value1 ,

    SUBSTRING(e.Value, CHARINDEX(',', e.Value, 1) + 1, 100) AS Value2

    INTO #Results

    FROM #Elements AS e

    CROSS APPLY (

    SELECT TOP 1 VALUE

    FROM

    #categories AS categories

    WHERE

    categories.Number < e.Number

    ORDER BY

    categories.Number DESC) AS cat(Category)

    WHERE e.VALUE LIKE '[0-9]%';

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • Awesome thread all 🙂

    Craig Farrell (4/26/2011)


    opc.three (4/26/2011)


    I think this constitutes a triangular JOIN:

    ON A.RowNum = B.RowNum - 1

    I didn't go through everything, but that's not a triangular join. This is:

    A.RowNum < B.RowNum

    A common running totals workaround.

    EDIT: That's what I get for not reading everything. Here's the triangle you spotted:

    ON e.Number > T.NumberStart

    AND e.Number < T.NumberEnd

    Hey Craig, I could see that something bad was going on in the query...in looking at it closer though and comparing it to the examples in Jeff's article I am not sure it is triangular...its neat though...but I don't know what to call it...a STRIPE JOIN? :hehe:

    LutzM (4/26/2011)


    I don't think that a triangle. It's basically covering the range of values between two categories.

    I agree that it may not be purely triangular but it reads a ton more data than the QU method, however less than a triangle by limiting the scan to between two categories as you said.

    LutzM (4/26/2011)


    So, the question comes down to the point: Does the effect of being 15 times faster than the ROW_NUMBER approach validate the rules to be followed to make the fast solution robust, too?

    I put this question out there with the hunch that QU would be what I got back as a "best performing" option. Like you MM I have not been pressed to use the QU method...til now...but unlike you I have taken an issue with it due to the two points I previously made. I will need to decide in my case if it's justified based on all factors. Lutz, your points are well taken, especially "skills level available to maintain code"...I venture to say that QU is not for the SQL novice. Big surprise though, to QU or not to QU is dependent on the situation.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Viewing 11 posts - 1 through 10 (of 10 total)

    You must be logged in to reply to this topic. Login to reply