Normalize data using Pivot WITHOUT using Aggregate function

Question

Normalize data using Pivot WITHOUT using Aggregate function

Forum Etiquette: How to post Reporting Services problems

[/url]

Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
[/url]

How to Post Performance Problems - by Gail Shaw
[/url]

Jeff Moden SSC Guru Points: 1004701 More actions · Answer 1

mister.magoo (12/30/2009)
For the sake of it, a solution which copes with new "ColumnName" values by using dynamic sql (go ahead - pork chop me or whatever you do :-D)

Heh... nah... no pork chops for you. I don't believe that dynamic SQL is all that bad a thing. Take a look at the following... I even published that it's a good thing when done correctly.

http://www.sqlservercentral.com/articles/Crosstab/65048/

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004701 More actions · Answer 2

lobbymuncher (12/30/2009)
Another overcomplicated effort from the peanut gallery.

I like it. Nicely done.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004701 More actions · Answer 3

Jeff Moden (12/30/2009)
lobbymuncher (12/30/2009)
Another overcomplicated effort from the peanut gallery.
I like it. Nicely done.

Ack... I may have to take that back. The code has a triangular join in it that causes 60 internal rows to be read from one of the 4 appearances of the table in the execution plan. It's not quite as bad as a full blown triangular join in that it reads each row once for each group where as a real triangular join at the row level would be devastating here.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004701 More actions · Answer 4

Justin,

Please post the table creation statement and be sure to include any indexes you may have on the table.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004701 More actions · Answer 5

Jeff Moden (12/30/2009)
Jeff Moden (12/30/2009)
lobbymuncher (12/30/2009)
Another overcomplicated effort from the peanut gallery.
I like it. Nicely done.
Ack... I may have to take that back. The code has a triangular join in it that causes 60 internal rows to be read from one of the 4 appearances of the table in the execution plan. It's not quite as bad as a full blown triangular join in that it reads each row once for each group where as a real triangular join at the row level would be devastating here.

Nah... it's good. The WHERE clause keeps it from being a full "Triangular Join" but only if you have an index on ID.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004701 More actions · Answer 6

Man, in the presence of a clustered index on the ID, that's some mighty fast code there, Chris.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ChrisM@home SSC-Insane Points: 24260 More actions · Answer 7

Jeff Moden (12/30/2009)
Man, in the presence of a clustered index on the ID, that's some mighty fast code there, Chris.

:blush: thanks Jeff. I read your triangular join article!

[font="Arial"]^{Low-hanging fruit picker and defender of the moggies}[/font]

For better assistance in answering your questions, please read this[/url].

Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

jjbear11 Default port Points: 1497 More actions · Answer 8

mister.magoo (12/30/2009)
For the sake of it, a solution which copes with new "ColumnName" values by using dynamic sql (go ahead - pork chop me or whatever you do :-D)
If you don't like dynamic SQL , then you could just use this to generate the query as and when the data requires a change....just change the "exec sp_executesql @sql" line to "print @sql"....
--= Dump the data into a temp table as this uses dynamic sql and needs a "proper" table to read.
IF OBJECT_ID('tempdb..#school') IS NULL
SELECT ID, ColumnName, [Text]
INTO #school
FROM @school
DECLARE @sql NVARCHAR(4000)
SET @sql = ''
;WITH base(ID,ColName,Text,ColNum)
AS
(
--= Get the data into some kind of order with the ID from the '[DW].[DimClass]' row for each set of data as the key
--= And a column number to ascertain the correct sequence for the columns later
SELECT s1.ID, Details.ColName, Details.Text , ROW_NUMBER() OVER(PARTITION BY s1.ID ORDER BY Details.ID) AS ColNum
FROM #school s1
OUTER APPLY (
--= Get the associated rows for the current group/ID
SELECT ID,ColumnName,[Text]
FROM #school s2
WHERE s2.ID>s1.ID
AND s2.ID < ISNULL(
(
--= Get the ID of the next group of data
SELECT TOP 1 ID
FROM #school s3
WHERE s3.ID>s1.ID AND s3.ColumnName = 'DestinationTableName'
ORDER BY s3.ID
),
(
--= Get the last row of data when there are no more groups
SELECT MAX(ID) FROM #school
)
)
) AS Details(ID,ColName,TEXT)
WHERE s1.ColumnName = 'DestinationTableName'
), cols(Name,POSITION)
AS
(
--= Get the distinct ColumnName values and positions (assuming the data will always place them in order in the first place)
SELECT ColName ,MAX(ColNum)
FROM base
GROUP BY ColName
), colList(list)
AS
(
--= Use FOR XML PATH('') trick to get a list of columns
SELECT STUFF((
SELECT ', '+QUOTENAME(Name,'[')
FROM cols
ORDER BY POSITION
FOR XML PATH('')),1,1,'')
)
SELECT @sql=list
FROM colList
--= Now build a dynamic SQL to PIVOT the data
SET @sql = N'
;WITH base(BASEID,ColName,TEXT)
AS
(
SELECT s1.ID, Details.ColName, Details.Text
FROM #school s1
OUTER APPLY (
SELECT ID,ColumnName,[Text]
FROM #school s2
WHERE s2.ID>s1.ID
AND s2.ID < ISNULL(
(
SELECT TOP 1 ID
FROM #school s3
WHERE s3.ID>s1.ID AND s3.ColumnName = ''DestinationTableName''
ORDER BY s3.ID
),
(
SELECT MAX(ID) FROM #school
)
)
) AS Details(ID,ColName,TEXT)
WHERE s1.ColumnName = ''DestinationTableName''
)
SELECT BASEID,' + @sql + '
FROM base
PIVOT (MAX([Text]) FOR ColName IN ('+@sql +')) AS PVT
ORDER BY BASEID'
EXEC sp_executesql @sql
IF NOT (OBJECT_ID('tempdb..#school') IS NULL)
DROP TABLE #school
I make no claims for performance as I have not tried to optimise it, but it demonstrates a technique that could be used...
MM

This query did exactly what I needed! However, the execution time is HORRIFIC -- 1 day, 5 hours, and 5 minutes to execute. For the record, the table on which I'm operating this query has appx 900,000 rows which are being pivoted out into 46 possible columns. The indexes that I have on the @school table are below:

create NONClustered index idx1 on @school (columnname)

create Unique clustered index idx2 on @school (ID)

Are there any improvements that you can think of which I could incorporate into this query? I'm not that familiar with indexes, but have at least a basic understanding of them.

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 9

Hi Justin

This solution has a triangular join, notorious for scaling poorly - but over a day! Blimey. The solution I posted earlier doesn't, so may be worth a try. If I were in your shoes, I'd hard-code those 46 potential columns because it will take less time than dicking around with dynamic sql.

Cheers

ChrisM

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

LutzM SSC Guru Points: 107049 More actions · Answer 10

Justin James (1/4/2010)
mister.magoo (12/30/2009)
For the sake of it, a solution which copes with new "ColumnName" values by using dynamic sql (go ahead - pork chop me or whatever you do :-D)
If you don't like dynamic SQL , then you could just use this to generate the query as and when the data requires a change....just change the "exec sp_executesql @sql" line to "print @sql"....
...
I make no claims for performance as I have not tried to optimise it, but it demonstrates a technique that could be used...
MM
This query did exactly what I needed! However, the execution time is HORRIFIC -- 1 day, 5 hours, and 5 minutes to execute. For the record, the table on which I'm operating this query has appx 900,000 rows which are being pivoted out into 46 possible columns. The indexes that I have on the @school table are below:
create NONClustered index idx1 on @school (columnname)
create Unique clustered index idx2 on @school (ID)
Are there any improvements that you can think of which I could incorporate into this query? I'm not that familiar with indexes, but have at least a basic understanding of them.

Is there any specific reason why you didn't use lobbymunchers or my solution?

Both are supposed to perform pretty good. Even better with clustered index on the ID, as Jeff mentioned in an earlier post.

It makes me wondering why you go for a solution that had a performance disclaimer attached in the first place...

As a rough duration estimate (depending on the system and, of course, index applied as mentioned before) we should talk about seconds, not hours, or even DAYS!!!

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

mister.magoo SSC-Forever Points: 47174 More actions · Answer 11

I agree that is HORRENDOUS!

If you can get away with a fixed query then do so.(One of the other examples)

If you have to cater for dynamic columns then let us know and someone will come up with a quick scalable method.

(I am busy right now but have some ideas)

MM

select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

jjbear11 Default port Points: 1497 More actions · Answer 12

After working with MM's query, I was able to get results to come back in 36 seconds. Below is the final SQL that I was able to get working:

--= Dump the data into a temp table as this uses dynamic sql and needs a "proper" table to read.

IF OBJECT_ID('tempdb..#school') IS NULL

SELECT UniqueID as [ID], ColumnName, [Text]

INTO #school

FROM @school

order by uniqueid

create NONClustered index idx1 on #school (columnname)

create Unique clustered index idx2 on #school (ID)

update statistics #school

DECLARE @sql NVARCHAR(4000)

SET @sql = ''

;WITH base(ID,ColName,Text,ColNum)

AS

(

--= Get the data into some kind of order with the ID from the 'DestinationTableName' row for each set of data as the key

--= And a column number to ascertain the correct sequence for the columns later

SELECT s1.ID, Details.ColName, Details.Text , ROW_NUMBER() OVER(PARTITION BY s1.ID ORDER BY Details.ID) AS ColNum

FROM #school s1

OUTER APPLY (

--= Get the associated rows for the current group/ID

SELECT ID,ColumnName,[Text]

FROM #school s2

WHERE s2.ID>s1.ID

AND s2.ID < (

(

--= Get the ID of the next group of data

SELECT TOP 1 isnull(ID,(SELECT max(ID) from #school))

FROM #school s3

WHERE s3.ID>s1.ID AND s3.ColumnName = 'DestinationTableName'

ORDER BY s3.ID

)

) AS Details(ID,ColName,TEXT)

WHERE s1.ColumnName = 'DestinationTableName'

), cols(Name,POSITION)

AS

(

--= Get the distinct ColumnName values and positions (assuming the data will always place them in order in the first place)

SELECT ColName ,MAX(ColNum)

FROM base

GROUP BY ColName

), colList(list)

AS

(

--= Use FOR XML PATH('') trick to get a list of columns

SELECT STUFF((

SELECT ', '+QUOTENAME(Name,'[')

FROM cols

ORDER BY POSITION

FOR XML PATH('')),1,1,'')

)

SELECT @sql=list

FROM colList

--= Now build a dynamic SQL to PIVOT the data

SET @sql = N'

;WITH base(BASEID,ColName,TEXT)

AS

(

SELECT s1.ID, Details.ColName, Details.Text

FROM #school s1

OUTER APPLY (

SELECT ID,ColumnName,[Text]

FROM #school s2

WHERE s2.ID>s1.ID

AND s2.ID < (

(

SELECT TOP 1 isnull(ID,(select max(ID) from #school))

FROM #school s3

WHERE s3.ID>s1.ID AND s3.ColumnName = ''DestinationTableName''

ORDER BY s3.ID

)

) AS Details(ID,ColName,TEXT)

WHERE s1.ColumnName = ''DestinationTableName''

)

SELECT BASEID,' + @sql + '

FROM base

PIVOT (MAX([Text]) FOR ColName IN ('+@sql +')) AS PVT

ORDER BY BASEID'

--print @sql

EXEC sp_executesql @sql

IF NOT (OBJECT_ID('tempdb..#school') IS NULL)

DROP TABLE #school

The only changes that I made were:

1. Added indexes to the #school table to increase performance

create nonclustered index idx1 on #school (columnname)

create Unique clustered index idx2 on #school (ID)

2. Modified MM's original query, removing the ISNULL function, and incorporating it into the subquery.

MANY MANY thanks for all of the help that you all have provided!! These forums are an indispensable source of information! 😀

Jeff Moden SSC Guru Points: 1004701 More actions · Answer 13

The reason why it took so long is because, in the presence of a triangular join, it must spawn over 405,000,450,000 rows. To put it more succinctly, it spawned .4 TERA rows!!!!!!! I'm surprised it didn't take a week. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 14

Jeff Moden (1/4/2010)
The reason why it took so long is because, in the presence of a triangular join, it must spawn over 405,000,450,000 rows. To put it more succinctly, it spawned .4 TERA rows!!!!!!! I'm surprised it didn't take a week. 😉

Some people are so reluctant to learn. Here Jeff, take this large bag of frozen pork chops.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

mister.magoo SSC-Forever Points: 47174 More actions · Answer 15

Justin,

I think you may still have some work to do as the change you made will be excluding the final grouping - I think...

moving the ISNULL the way you have will not solve the problem of finding the records belonging to the final group.

(

--= Get the ID of the next group of data

SELECT TOP 1 isnull(ID,(SELECT max(ID) from #school))

FROM #school s3

WHERE s3.ID>s1.ID AND s3.ColumnName = 'DestinationTableName'

ORDER BY s3.ID

)

DOES NOT return the same information as

ISNULL(

(

--= Get the ID of the next group of data

SELECT TOP 1 ID

FROM #school s3

WHERE s3.ID>s1.ID AND s3.ColumnName = 'DestinationTableName'

ORDER BY s3.ID

),

(

--= Get the last row of data when there are no more groups

SELECT MAX(ID) FROM #school

)

But, if you just declare a variable and select the MAX(ID) into that outside the main query, you should get the same saving...

-- outside the CTE:

DECLARE @MaxID BIGINT

SELECT @MaxID=MAX(ID) FROM #school

-- inside - where you made your change

ISNULL(

(

--= Get the ID of the next group of data

SELECT TOP 1 ID

FROM #school s3

WHERE s3.ID>s1.ID AND s3.ColumnName = 'DestinationTableName'

ORDER BY s3.ID

),

(

--= Get the last row of data when there are no more groups

@MaxID

)

There should be further savings possible later....

MM

select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);