Best way to normalize

Question

Best way to normalize

Orlando Colamatteo

SSC Guru

Points: 182279
More actions
April 25, 2011 at 6:05 am

#238466

The SELECT query below returns the desired resultset but it has to "look back" at the data to find the Category for each row. I could have done an explicit loop, but figured I should at least be considerate enough to hide my RBAR 😀 I am confident this technique will slow down significantly as the data set grows. Is there a "better" way to write this query? I have a tally table and am on SQL 2008.
DECLARE @Elements TABLE
(
Number INT ,
Value VARCHAR(MAX)
) ;
INSERT INTO @Elements
( Number ,
Value
)
SELECT 1 AS [Number] ,
N'Category A' AS [Value]
UNION ALL
SELECT 2 AS [Number] ,
N'281,286' AS [Value]
UNION ALL
SELECT 3 AS [Number] ,
N'281,283' AS [Value]
UNION ALL
SELECT 4 AS [Number] ,
N'281,282' AS [Value]
UNION ALL
SELECT 5 AS [Number] ,
N'281,282' AS [Value]
UNION ALL
SELECT 6 AS [Number] ,
N'280,283' AS [Value]
UNION ALL
SELECT 7 AS [Number] ,
N'Category B' AS [Value]
UNION ALL
SELECT 8 AS [Number] ,
N'250,257' AS [Value]
UNION ALL
SELECT 9 AS [Number] ,
N'265,256' AS [Value]
UNION ALL
SELECT 10 AS [Number] ,
N'234,255' AS [Value]
UNION ALL
SELECT 11 AS [Number] ,
N'249,254' AS [Value]
UNION ALL
SELECT 12 AS [Number] ,
N'250,254' AS [Value]
UNION ALL
SELECT 13 AS [Number] ,
N'Category C' AS [Value]
UNION ALL
SELECT 14 AS [Number] ,
N'281,311' AS [Value]
UNION ALL
SELECT 15 AS [Number] ,
N'218,220' AS [Value]
UNION ALL
SELECT 16 AS [Number] ,
N'218,218' AS [Value]
UNION ALL
SELECT 17 AS [Number] ,
N'219,221' AS [Value]
UNION ALL
SELECT 18 AS [Number] ,
N'218,219' AS [Value] ;
SELECT ( SELECT Value
FROM @Elements
WHERE Number = ( SELECT MAX(Number)
FROM @Elements
WHERE Number < e.Number
AND Value LIKE 'Category%'
)
) AS Category ,
SUBSTRING(e.Value, 1, CHARINDEX(',', e.Value, 1) - 1) AS Value1 ,
SUBSTRING(e.Value, CHARINDEX(',', e.Value, 1) + 1, 100) AS Value2
FROM @Elements e
WHERE e.VALUE NOT LIKE 'Category%' ;
There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato
Forum Etiquette: How to post Reporting Services problems

[/url]

Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
[/url]

How to Post Performance Problems - by Gail Shaw
[/url]

Lamprey13 SSCertifiable Points: 6210 More actions · Answer 1

I'm not sure what the data represents, but it appears that you'd be better off changing the schema rather than working on the query. Again, without knowing what the data represents, I'd suggest a Category table and then relate the Element table to the category table.

Orlando Colamatteo SSC Guru Points: 182279 More actions · Answer 2

Lamprey13 (4/25/2011)
I'm not sure what the data represents, but it appears that you'd be better off changing the schema rather than working on the query. Again, without knowing what the data represents, I'd suggest a Category table and then relate the Element table to the category table.

The end goal is to put the data into a table that looks like this, but first I need to relate the category from a preceding row to the rows that follow it:

CREATE TABLE dbo.CategoryInfoAboutSpecificMeasurements

(

CategoryID INT,

SomeMeasurement INT,

SomeOtherMeasurement INT

) ;

The data in my sample comes from a file and using SSIS is not an option. I have to flatten it so I can normalize it. Also worth mentioning: there can be a different number of data rows between categories, i.e. unlike my test data the Category Name will not be in every 6th row of the data file.

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 3

The query creates a couple of Triangular joins where one internal leg consists of 270 rows and the other consists of 135 rows. Please see the following article for why that's a bad thing.

http://www.sqlservercentral.com/articles/T-SQL/61539/

How many rows does this "staging" table normally contain?

And, since it is just a staging table, can we add an extra column to it?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 4

Alright... I'll just do like I normally do and test this type of thing on a million rows. Here's the code to make a million row test table like your original table. The key here is that this table is supposed to be a staging table of some sort. If that's true, then you should be able to add the 3 columns I made to speed things up and to allow for verification compared to the original data.

Not to worry... this whole thing, including building the Clustered Index takes something less than 10 seconds on my 9 year old desktop. It should fly on a more modern machine. As usual, the details are in the comments.

--========================================================================================

-- Build the test data. Only the extra column we added is a part of the solution.

--========================================================================================

--===== Conditionally drop the test table to make reruns easier.

IF OBJECT_ID('tempdb..#Elements','U') IS NOT NULL

DROP TABLE #Elements

;

--===== Declare how many rows we want in the test table.

DECLARE @RowsToTest INT;

SELECT @RowsToTest = 1000000;

--===== Now, build the test table and populate it on the fly.

-- Not to worry... this takes less than 5 seconds.

WITH

cteGenRows AS

(

SELECT TOP (@RowsToTest)

Number = CAST(ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS INT),

[Value] = CAST(ABS(CHECKSUM(NEWID()))%1000 AS VARCHAR(3)) + ','

+ CAST(ABS(CHECKSUM(NEWID()))%1000 AS VARCHAR(3))

FROM sys.all_columns ac1

CROSS JOIN sys.all_columns ac2

)

SELECT Number = ISNULL(Number,0), --The ISNULL makes this column NOT NULL

[Value] = CASE

WHEN (Number-1)%(@RowsToTest/25) <> 0

THEN [Value]

ELSE 'Category '+CHAR((Number-1)/(@RowsToTest/25)+65)

END,

Category = CAST(NULL AS VARCHAR(100)), --This creates an empty column

SomeMeasurement = CAST(NULL AS INT), --This creates an empty column

SomeOtherMeasurement = CAST(NULL AS INT) --This creates an empty column

INTO #Elements

FROM cteGenRows

;

--===== Add the quintessential Clustered Index

-- This adds about 4 seconds to the job.

ALTER TABLE #Elements

ADD PRIMARY KEY CLUSTERED (Number) WITH FILLFACTOR = 100

;

I tried your code on the million row table just to see if the Triangular joins changed up. They didn't and I stopped your code after 6 minutes.

The following code updates the original staging table with all the information you asked for in less than 11 seconds. It uses a special update called a "Protected Quirky Update". I'm in the process of updating the article with the "Protected" part but here's the article about how it works (pay no attention to the explanation about the execution plan... it's wrong and will be corrected as well). [font="Arial Black"]Please notice that you have to follow the rules given or this method will burn you.[/font]

http://www.sqlservercentral.com/articles/T-SQL/68467/

Part of the rules will include the new safety feature that Paul White and Tom Thomson contributed. Gotta say it again... if you don't follow the rules in the article along with this new safety feature, don't use this method. Use a Cursor, instead. It'll still be a lot faster than the Triangular Joins.

--========================================================================================

-- "Smear" the data down using a protected Quirky Update

-- This is the solution to the problem

--========================================================================================

--===== Declare some obviously named variables

DECLARE @PrevCategory VARCHAR(100),

@Anchor INT,

@Safety INT

;

SELECT @Safety = 1

;

--===== Now, "Smear" the category names "downward".

-- This also takes care of the split on measurements.

-- This whole thing takes less than 12 seconds.

UPDATE tgt

SET @PrevCategory = Category

= CASE

WHEN Number = @Safety

THEN CASE

WHEN [Value] LIKE 'Category%'

THEN [Value]

ELSE @PrevCategory

END

ELSE CAST(1/0 AS VARCHAR(100)) --This forces an error if the Number and the Safety get out of sync

END,

@Anchor = Number,

SomeMeasurement = CASE

WHEN [Value] NOT LIKE 'Category%'

THEN SUBSTRING([Value],1,CHARINDEX(',',[Value])-1)

END,

SomeOtherMeasurement = CASE

WHEN [Value] NOT LIKE 'Category%'

THEN SUBSTRING([Value],CHARINDEX(',',[Value])+1,8000)

END,

@Safety = @Safety + 1

FROM #Elements tgt WITH (TABLOCKX)

OPTION (MAXDOP 1)

;

Then, just select whatever you need in the order you need it.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Lamprey13 SSCertifiable Points: 6210 More actions · Answer 5

For what it is worth, here is an alternative to the Quirky Update also using Jeff's data (thanks for the data Jeff!). First here is a plain select:;WITH cte

AS

(

SELECT Number, Value, ROW_NUMBER() OVER (ORDER BY Number) AS RowNum

FROM #Elements

WHERE Value LIKE 'Category%'

)

SELECT

T.Value,

CASE

WHEN e.Value NOT LIKE 'Category%'

THEN SUBSTRING(e.Value, 1, CHARINDEX(',', e.Value) - 1)

END AS Value1,

CASE

WHEN e.Value NOT LIKE 'Category%'

THEN SUBSTRING(e.Value, CHARINDEX(',', e.Value) + 1, 8000)

END AS Value2

FROM

#Elements AS e

INNER JOIN

(

SELECT

A.Value,

A.Number AS NumberStart,

COALESCE(B.Number, 2147483647) AS NumberEnd

FROM

cte AS A

LEFT OUTER JOIN

cte AS B

ON A.RowNum = B.RowNum - 1

) AS T

ON e.Number > T.NumberStart

AND e.Number < T.NumberEnd

Next here is update that should produce the smae results as Jeff's update, but without using teh Quirky method:WITH cte

AS

(

SELECT Number, Value, ROW_NUMBER() OVER (ORDER BY Number) AS RowNum

FROM #Elements

WHERE Value LIKE 'Category%'

)

UPDATE

e

SET

SomeMeasurement =

CASE

WHEN e.Value NOT LIKE 'Category%'

THEN SUBSTRING(e.Value, 1, CHARINDEX(',', e.Value) - 1)

END,

SomeOtherMeasurement =

CASE

WHEN e.Value NOT LIKE 'Category%'

THEN SUBSTRING(e.Value, CHARINDEX(',', e.Value) + 1, 8000)

END

FROM

#Elements AS e

INNER JOIN

(

SELECT

A.Value,

A.Number AS NumberStart,

COALESCE(B.Number, 2147483647) AS NumberEnd

FROM

cte AS A

LEFT OUTER JOIN

cte AS B

ON A.RowNum = B.RowNum - 1

) AS T

ON e.Number > T.NumberStart

AND e.Number < T.NumberEnd On my lap top but updates run in the same amount of time.

Orlando Colamatteo SSC Guru Points: 182279 More actions · Answer 6

Lamprey13, thanks for your take. Your solution took quite a bit longer than Jeff's solution on my machine so I timed it to get an idea of the differences...with IO and TIME STATISTICS ON:

Jeff's:

(1000000 row(s) affected)
Table '#Elements_____________________000000000008'. Scan count 1, logical reads 43336, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 4407 ms, elapsed time = 6022 ms.
(1000000 row(s) affected)
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.

Yours:

(1000000 row(s) affected)
Table '#Elements_____________________000000000007'. Scan count 27, logical reads 3121652, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 24, logical reads 37156642, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 70656 ms, elapsed time = 72764 ms.
(999975 row(s) affected)
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.

I think this constitutes a triangular JOIN:

ON A.RowNum = B.RowNum - 1

My query does something similar, looking back at "previous rows" creating a nested loop and lots of additional IO, however your query provides a vast improvement over my initial attempt.

Thanks for your posts Jeff, I figured I might be looking at a QU as the fastest way possible but wasn't 100% sure, hence why I posted 😀

...wow there are a lot of rules! I had previously read both the articles you linked to, but have resisted adopting the QU technique because 1) the number of rules (did I mention that already) and 2) the reliance on undocumented (sorry to say it...again...I know you know) behavior allowing the chance for a CU, SP or Upgrade to "break" the code. I do like the new 1/0 addition. That said, I need to take another look at QU. I look forward to your upcoming article too. Thanks!

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 7

opc.three (4/26/2011)
I think this constitutes a triangular JOIN:
ON A.RowNum = B.RowNum - 1

I didn't go through everything, but that's not a triangular join. This is:

A.RowNum < B.RowNum

A common running totals workaround.

EDIT: That's what I get for not reading everything. Here's the triangle you spotted:

ON e.Number > T.NumberStart

AND e.Number < T.NumberEnd

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

LutzM SSC Guru Points: 107049 More actions · Answer 8

I don't think that a triangle. It's basically covering the range of values between two categories.

I tried a similar approach in parallel but Lamprey13 did post his solution earlier. And I can confirm the performance metrics opc.three mentioned. Almost exactly, actually.

So, the question comes down to the point: Does the effect of being 15 times faster than the ROW_NUMBER approach validate the rules to be followed to make the fast solution robust, too?

I'd say: it depends. With a trend towards "yes". It depends on the actual number of rows and the frequency the query is called followed by the skills level available to maintain the code.

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

mister.magoo SSC-Forever Points: 47174 More actions · Answer 9

If you really aren't comfortable with the QU (I don't have a problem with it myself, but have yet to need it...), then a simpler pre-select and index gets the job done (building a new table that is) in a total duration of about about 2 secs using Jeff's sample data on my desktop.

SELECT Number

, VALUE

INTO

#categories

FROM

#Elements

WHERE

VALUE LIKE 'Cat%'

CREATE UNIQUE CLUSTERED INDEX ix1 ON #categories(Number);

SELECT cat.Category

, SUBSTRING(e.value, 1, CHARINDEX(',', e.Value, 1) - 1) AS Value1 ,

SUBSTRING(e.Value, CHARINDEX(',', e.Value, 1) + 1, 100) AS Value2

INTO #Results

FROM #Elements AS e

CROSS APPLY (

SELECT TOP 1 VALUE

FROM

#categories AS categories

WHERE

categories.Number < e.Number

ORDER BY

categories.Number DESC) AS cat(Category)

WHERE e.VALUE LIKE '[0-9]%';

MM

select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

Orlando Colamatteo SSC Guru Points: 182279 More actions · Answer 10

Awesome thread all 🙂

Craig Farrell (4/26/2011)
opc.three (4/26/2011)
I think this constitutes a triangular JOIN:
ON A.RowNum = B.RowNum - 1
I didn't go through everything, but that's not a triangular join. This is:
A.RowNum < B.RowNum
A common running totals workaround.
EDIT: That's what I get for not reading everything. Here's the triangle you spotted:
ON e.Number > T.NumberStart
AND e.Number < T.NumberEnd

Hey Craig, I could see that something bad was going on in the query...in looking at it closer though and comparing it to the examples in Jeff's article I am not sure it is triangular...its neat though...but I don't know what to call it...a STRIPE JOIN? :hehe:

LutzM (4/26/2011)
I don't think that a triangle. It's basically covering the range of values between two categories.

I agree that it may not be purely triangular but it reads a ton more data than the QU method, however less than a triangle by limiting the scan to between two categories as you said.

LutzM (4/26/2011)
So, the question comes down to the point: Does the effect of being 15 times faster than the ROW_NUMBER approach validate the rules to be followed to make the fast solution robust, too?

I put this question out there with the hunch that QU would be what I got back as a "best performing" option. Like you MM I have not been pressed to use the QU method...til now...but unlike you I have taken an issue with it due to the two points I previously made. I will need to decide in my case if it's justified based on all factors. Lutz, your points are well taken, especially "skills level available to maintain code"...I venture to say that QU is not for the SQL novice. Big surprise though, to QU or not to QU is dependent on the situation.

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato