April 25, 2011 at 6:05 am
The SELECT query below returns the desired resultset but it has to "look back" at the data to find the Category for each row. I could have done an explicit loop, but figured I should at least be considerate enough to hide my RBAR 😀 I am confident this technique will slow down significantly as the data set grows. Is there a "better" way to write this query? I have a tally table and am on SQL 2008.
DECLARE @Elements TABLE
(
Number INT ,
Value VARCHAR(MAX)
) ;
INSERT INTO @Elements
( Number ,
Value
)
SELECT 1 AS [Number] ,
N'Category A' AS [Value]
UNION ALL
SELECT 2 AS [Number] ,
N'281,286' AS [Value]
UNION ALL
SELECT 3 AS [Number] ,
N'281,283' AS [Value]
UNION ALL
SELECT 4 AS [Number] ,
N'281,282' AS [Value]
UNION ALL
SELECT 5 AS [Number] ,
N'281,282' AS [Value]
UNION ALL
SELECT 6 AS [Number] ,
N'280,283' AS [Value]
UNION ALL
SELECT 7 AS [Number] ,
N'Category B' AS [Value]
UNION ALL
SELECT 8 AS [Number] ,
N'250,257' AS [Value]
UNION ALL
SELECT 9 AS [Number] ,
N'265,256' AS [Value]
UNION ALL
SELECT 10 AS [Number] ,
N'234,255' AS [Value]
UNION ALL
SELECT 11 AS [Number] ,
N'249,254' AS [Value]
UNION ALL
SELECT 12 AS [Number] ,
N'250,254' AS [Value]
UNION ALL
SELECT 13 AS [Number] ,
N'Category C' AS [Value]
UNION ALL
SELECT 14 AS [Number] ,
N'281,311' AS [Value]
UNION ALL
SELECT 15 AS [Number] ,
N'218,220' AS [Value]
UNION ALL
SELECT 16 AS [Number] ,
N'218,218' AS [Value]
UNION ALL
SELECT 17 AS [Number] ,
N'219,221' AS [Value]
UNION ALL
SELECT 18 AS [Number] ,
N'218,219' AS [Value] ;
SELECT ( SELECT Value
FROM @Elements
WHERE Number = ( SELECT MAX(Number)
FROM @Elements
WHERE Number < e.Number
AND Value LIKE 'Category%'
)
) AS Category ,
SUBSTRING(e.Value, 1, CHARINDEX(',', e.Value, 1) - 1) AS Value1 ,
SUBSTRING(e.Value, CHARINDEX(',', e.Value, 1) + 1, 100) AS Value2
FROM @Elements e
WHERE e.VALUE NOT LIKE 'Category%' ;
There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato
April 25, 2011 at 12:14 pm
I'm not sure what the data represents, but it appears that you'd be better off changing the schema rather than working on the query. Again, without knowing what the data represents, I'd suggest a Category table and then relate the Element table to the category table.
April 25, 2011 at 12:55 pm
Lamprey13 (4/25/2011)
I'm not sure what the data represents, but it appears that you'd be better off changing the schema rather than working on the query. Again, without knowing what the data represents, I'd suggest a Category table and then relate the Element table to the category table.
The end goal is to put the data into a table that looks like this, but first I need to relate the category from a preceding row to the rows that follow it:
CREATE TABLE dbo.CategoryInfoAboutSpecificMeasurements
(
CategoryID INT,
SomeMeasurement INT,
SomeOtherMeasurement INT
) ;
The data in my sample comes from a file and using SSIS is not an option. I have to flatten it so I can normalize it. Also worth mentioning: there can be a different number of data rows between categories, i.e. unlike my test data the Category Name will not be in every 6th row of the data file.
There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato
April 25, 2011 at 7:01 pm
The query creates a couple of Triangular joins where one internal leg consists of 270 rows and the other consists of 135 rows. Please see the following article for why that's a bad thing.
http://www.sqlservercentral.com/articles/T-SQL/61539/
How many rows does this "staging" table normally contain?
And, since it is just a staging table, can we add an extra column to it?
--Jeff Moden
Change is inevitable... Change for the better is not.
April 25, 2011 at 9:12 pm
Alright... I'll just do like I normally do and test this type of thing on a million rows. Here's the code to make a million row test table like your original table. The key here is that this table is supposed to be a staging table of some sort. If that's true, then you should be able to add the 3 columns I made to speed things up and to allow for verification compared to the original data.
Not to worry... this whole thing, including building the Clustered Index takes something less than 10 seconds on my 9 year old desktop. It should fly on a more modern machine. As usual, the details are in the comments.
--========================================================================================
-- Build the test data. Only the extra column we added is a part of the solution.
--========================================================================================
--===== Conditionally drop the test table to make reruns easier.
IF OBJECT_ID('tempdb..#Elements','U') IS NOT NULL
DROP TABLE #Elements
;
--===== Declare how many rows we want in the test table.
DECLARE @RowsToTest INT;
SELECT @RowsToTest = 1000000;
--===== Now, build the test table and populate it on the fly.
-- Not to worry... this takes less than 5 seconds.
WITH
cteGenRows AS
(
SELECT TOP (@RowsToTest)
Number = CAST(ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS INT),
[Value] = CAST(ABS(CHECKSUM(NEWID()))%1000 AS VARCHAR(3)) + ','
+ CAST(ABS(CHECKSUM(NEWID()))%1000 AS VARCHAR(3))
FROM sys.all_columns ac1
CROSS JOIN sys.all_columns ac2
)
SELECT Number = ISNULL(Number,0), --The ISNULL makes this column NOT NULL
[Value] = CASE
WHEN (Number-1)%(@RowsToTest/25) <> 0
THEN [Value]
ELSE 'Category '+CHAR((Number-1)/(@RowsToTest/25)+65)
END,
Category = CAST(NULL AS VARCHAR(100)), --This creates an empty column
SomeMeasurement = CAST(NULL AS INT), --This creates an empty column
SomeOtherMeasurement = CAST(NULL AS INT) --This creates an empty column
INTO #Elements
FROM cteGenRows
;
--===== Add the quintessential Clustered Index
-- This adds about 4 seconds to the job.
ALTER TABLE #Elements
ADD PRIMARY KEY CLUSTERED (Number) WITH FILLFACTOR = 100
;
I tried your code on the million row table just to see if the Triangular joins changed up. They didn't and I stopped your code after 6 minutes.
The following code updates the original staging table with all the information you asked for in less than 11 seconds. It uses a special update called a "Protected Quirky Update". I'm in the process of updating the article with the "Protected" part but here's the article about how it works (pay no attention to the explanation about the execution plan... it's wrong and will be corrected as well). [font="Arial Black"]Please notice that you have to follow the rules given or this method will burn you.[/font]
http://www.sqlservercentral.com/articles/T-SQL/68467/
Part of the rules will include the new safety feature that Paul White and Tom Thomson contributed. Gotta say it again... if you don't follow the rules in the article along with this new safety feature, don't use this method. Use a Cursor, instead. It'll still be a lot faster than the Triangular Joins.
--========================================================================================
-- "Smear" the data down using a protected Quirky Update
-- This is the solution to the problem
--========================================================================================
--===== Declare some obviously named variables
DECLARE @PrevCategory VARCHAR(100),
@Anchor INT,
@Safety INT
;
SELECT @Safety = 1
;
--===== Now, "Smear" the category names "downward".
-- This also takes care of the split on measurements.
-- This whole thing takes less than 12 seconds.
UPDATE tgt
SET @PrevCategory = Category
= CASE
WHEN Number = @Safety
THEN CASE
WHEN [Value] LIKE 'Category%'
THEN [Value]
ELSE @PrevCategory
END
ELSE CAST(1/0 AS VARCHAR(100)) --This forces an error if the Number and the Safety get out of sync
END,
@Anchor = Number,
SomeMeasurement = CASE
WHEN [Value] NOT LIKE 'Category%'
THEN SUBSTRING([Value],1,CHARINDEX(',',[Value])-1)
END,
SomeOtherMeasurement = CASE
WHEN [Value] NOT LIKE 'Category%'
THEN SUBSTRING([Value],CHARINDEX(',',[Value])+1,8000)
END,
@Safety = @Safety + 1
FROM #Elements tgt WITH (TABLOCKX)
OPTION (MAXDOP 1)
;
Then, just select whatever you need in the order you need it.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 26, 2011 at 9:49 am
For what it is worth, here is an alternative to the Quirky Update also using Jeff's data (thanks for the data Jeff!). First here is a plain select:;WITH cte
AS
(
SELECT Number, Value, ROW_NUMBER() OVER (ORDER BY Number) AS RowNum
FROM #Elements
WHERE Value LIKE 'Category%'
)
SELECT
T.Value,
CASE
WHEN e.Value NOT LIKE 'Category%'
THEN SUBSTRING(e.Value, 1, CHARINDEX(',', e.Value) - 1)
END AS Value1,
CASE
WHEN e.Value NOT LIKE 'Category%'
THEN SUBSTRING(e.Value, CHARINDEX(',', e.Value) + 1, 8000)
END AS Value2
FROM
#Elements AS e
INNER JOIN
(
SELECT
A.Value,
A.Number AS NumberStart,
COALESCE(B.Number, 2147483647) AS NumberEnd
FROM
cte AS A
LEFT OUTER JOIN
cte AS B
ON A.RowNum = B.RowNum - 1
) AS T
ON e.Number > T.NumberStart
AND e.Number < T.NumberEnd
Next here is update that should produce the smae results as Jeff's update, but without using teh Quirky method:WITH cte
AS
(
SELECT Number, Value, ROW_NUMBER() OVER (ORDER BY Number) AS RowNum
FROM #Elements
WHERE Value LIKE 'Category%'
)
UPDATE
e
SET
SomeMeasurement =
CASE
WHEN e.Value NOT LIKE 'Category%'
THEN SUBSTRING(e.Value, 1, CHARINDEX(',', e.Value) - 1)
END,
SomeOtherMeasurement =
CASE
WHEN e.Value NOT LIKE 'Category%'
THEN SUBSTRING(e.Value, CHARINDEX(',', e.Value) + 1, 8000)
END
FROM
#Elements AS e
INNER JOIN
(
SELECT
A.Value,
A.Number AS NumberStart,
COALESCE(B.Number, 2147483647) AS NumberEnd
FROM
cte AS A
LEFT OUTER JOIN
cte AS B
ON A.RowNum = B.RowNum - 1
) AS T
ON e.Number > T.NumberStart
AND e.Number < T.NumberEnd On my lap top but updates run in the same amount of time.
April 26, 2011 at 4:04 pm
Lamprey13, thanks for your take. Your solution took quite a bit longer than Jeff's solution on my machine so I timed it to get an idea of the differences...with IO and TIME STATISTICS ON:
Jeff's:
(1000000 row(s) affected)
Table '#Elements_____________________000000000008'. Scan count 1, logical reads 43336, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 4407 ms, elapsed time = 6022 ms.
(1000000 row(s) affected)
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
Yours:
(1000000 row(s) affected)
Table '#Elements_____________________000000000007'. Scan count 27, logical reads 3121652, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 24, logical reads 37156642, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 70656 ms, elapsed time = 72764 ms.
(999975 row(s) affected)
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
I think this constitutes a triangular JOIN:
ON A.RowNum = B.RowNum - 1
My query does something similar, looking back at "previous rows" creating a nested loop and lots of additional IO, however your query provides a vast improvement over my initial attempt.
Thanks for your posts Jeff, I figured I might be looking at a QU as the fastest way possible but wasn't 100% sure, hence why I posted 😀
...wow there are a lot of rules! I had previously read both the articles you linked to, but have resisted adopting the QU technique because 1) the number of rules (did I mention that already) and 2) the reliance on undocumented (sorry to say it...again...I know you know) behavior allowing the chance for a CU, SP or Upgrade to "break" the code. I do like the new 1/0 addition. That said, I need to take another look at QU. I look forward to your upcoming article too. Thanks!
There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato
April 26, 2011 at 4:37 pm
opc.three (4/26/2011)
I think this constitutes a triangular JOIN:
ON A.RowNum = B.RowNum - 1
I didn't go through everything, but that's not a triangular join. This is:
A.RowNum < B.RowNum
A common running totals workaround.
EDIT: That's what I get for not reading everything. Here's the triangle you spotted:
ON e.Number > T.NumberStart
AND e.Number < T.NumberEnd
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
April 26, 2011 at 4:48 pm
I don't think that a triangle. It's basically covering the range of values between two categories.
I tried a similar approach in parallel but Lamprey13 did post his solution earlier. And I can confirm the performance metrics opc.three mentioned. Almost exactly, actually.
So, the question comes down to the point: Does the effect of being 15 times faster than the ROW_NUMBER approach validate the rules to be followed to make the fast solution robust, too?
I'd say: it depends. With a trend towards "yes". It depends on the actual number of rows and the frequency the query is called followed by the skills level available to maintain the code.
April 26, 2011 at 5:15 pm
If you really aren't comfortable with the QU (I don't have a problem with it myself, but have yet to need it...), then a simpler pre-select and index gets the job done (building a new table that is) in a total duration of about about 2 secs using Jeff's sample data on my desktop.
SELECT Number
, VALUE
INTO
#categories
FROM
#Elements
WHERE
VALUE LIKE 'Cat%'
CREATE UNIQUE CLUSTERED INDEX ix1 ON #categories(Number);
SELECT cat.Category
, SUBSTRING(e.value, 1, CHARINDEX(',', e.Value, 1) - 1) AS Value1 ,
SUBSTRING(e.Value, CHARINDEX(',', e.Value, 1) + 1, 100) AS Value2
INTO #Results
FROM #Elements AS e
CROSS APPLY (
SELECT TOP 1 VALUE
FROM
#categories AS categories
WHERE
categories.Number < e.Number
ORDER BY
categories.Number DESC) AS cat(Category)
WHERE e.VALUE LIKE '[0-9]%';
MM
select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);
April 26, 2011 at 9:30 pm
Awesome thread all 🙂
Craig Farrell (4/26/2011)
opc.three (4/26/2011)
I think this constitutes a triangular JOIN:
ON A.RowNum = B.RowNum - 1
I didn't go through everything, but that's not a triangular join. This is:
A.RowNum < B.RowNum
A common running totals workaround.
EDIT: That's what I get for not reading everything. Here's the triangle you spotted:
ON e.Number > T.NumberStart
AND e.Number < T.NumberEnd
Hey Craig, I could see that something bad was going on in the query...in looking at it closer though and comparing it to the examples in Jeff's article I am not sure it is triangular...its neat though...but I don't know what to call it...a STRIPE JOIN? :hehe:
LutzM (4/26/2011)
I don't think that a triangle. It's basically covering the range of values between two categories.
I agree that it may not be purely triangular but it reads a ton more data than the QU method, however less than a triangle by limiting the scan to between two categories as you said.
LutzM (4/26/2011)
So, the question comes down to the point: Does the effect of being 15 times faster than the ROW_NUMBER approach validate the rules to be followed to make the fast solution robust, too?
I put this question out there with the hunch that QU would be what I got back as a "best performing" option. Like you MM I have not been pressed to use the QU method...til now...but unlike you I have taken an issue with it due to the two points I previously made. I will need to decide in my case if it's justified based on all factors. Lutz, your points are well taken, especially "skills level available to maintain code"...I venture to say that QU is not for the SQL novice. Big surprise though, to QU or not to QU is dependent on the situation.
There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato
Viewing 11 posts - 1 through 10 (of 10 total)
You must be logged in to reply to this topic. Login to reply