Working with large de-normalized dataset

Question

Post reply

Working with large de-normalized dataset

Kevin Dahl

Old Hand

Points: 380
More actions
February 22, 2011 at 9:57 am

#234808

Hi,
I'm trying to find a way to programmatically pivot a very large data set, the structure of which is:
CREATE TABLE [dbo].[f_Data](
[RecordID] [bigint] NOT NULL,
[RecordTypeID] [int] NOT NULL,
[DataVersionID] [int] NOT NULL,
[VariableName] [nvarchar](128) NOT NULL,
[VariableValue] [int] NULL,
CONSTRAINT [PK_f_Data_1] PRIMARY KEY CLUSTERED
(
[RecordID] ASC,
[VariableName] ASC
)
For each RecordID there are upwards of 2000 different variables that we would like to have a as columns, with a row for each RecordID. Now, I know I could do this using a PIVOT statement - but the prospect of typing out 2000 column names (for each different variable) isn't all that appealing. There's also an added difficulty in that there is a second table with additional data on the variables, such as the order that they should be listed in - this makes building a programmatic-yet-dynamic statement more difficult as well. Though I am not beyond doing that, I'm thinking there has to be an easier way (mainly because I can do exactly what I want in Excel with a few clicks). Any ideas?

Viewing 7 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply

Orlando Colamatteo SSC Guru Points: 182293 More actions · Answer 1

Building a PIVOT query dynamically is a viable option. Using your DDL I created a simple PIVOT query:

SELECT pvt.RecordID,

pvt.Variable1,

pvt.Variable2

FROM (SELECT RecordID,

VariableName,

VariableValue

FROM dbo.f_Data

) a PIVOT

( MAX(a.VariableValue) FOR a.VariableName IN (Variable1, Variable2) ) AS pvt ;

When building the SELECT and PIVOT column list joining to the external table you mentioned that supplies the column-order for the variable-columns is an extra step in building the SQL string you'll eventually execute however it will not be prohibitively complex.

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

LutzM SSC Guru Points: 107049 More actions · Answer 2

As an alternative to the PIVOT syntax you could have a look at the CrossTab and DynamicCrossTab article referenced in my signature.

It's not as easy as using Excel, but Excel isn't a RDBMS (at least not intentionally...).

Edit: stuff like PIVOT usually is done at the presentation layer (e.g. ReportingServices) since it violates normalization.

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Orlando Colamatteo SSC Guru Points: 182293 More actions · Answer 3

Lutz has a point...dynamically generating a crosstab query would require approximately the same level of effort from you and would almost certainly perform better than a PIVOT. My code example as a crosstab:

SELECT RecordID,

SUM(CASE WHEN VariableName = 'Variable1' THEN VariableValue ELSE NULL END) AS [Variable1],

SUM(CASE WHEN VariableName = 'Variable2' THEN VariableValue ELSE NULL END) AS [Variable2]

FROM (SELECT RecordID,

VariableName,

VariableValue

FROM dbo.f_Data

) d

GROUP BY RecordID

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

Kevin Dahl Old Hand Points: 380 More actions · Answer 4

opc.three (2/22/2011)
Building a PIVOT query dynamically is a viable option. Using your DDL I created a simple PIVOT query:
SELECT pvt.RecordID,
pvt.Variable1,
pvt.Variable2
FROM (SELECT RecordID,
VariableName,
VariableValue
FROM dbo.f_Data
) a PIVOT
( MAX(a.VariableValue) FOR a.VariableName IN (Variable1, Variable2) ) AS pvt ;
When building the SELECT and PIVOT column list joining to the external table you mentioned that supplies the column-order for the variable-columns is an extra step in building the SQL string you'll eventually execute however it will not be prohibitively complex.

I have actually tried doing the pivot and building the IN clause dynamically (using this approach http://www.simple-talk.com/community/blogs/andras/archive/2007/09/14/37265.aspx), but it turns out that the column names actually end up overflowing the max varchar length. It does work fine for a subset of variables, but not the whole dataset. On top of that, I actually need to try and get a description field in the column name as well so the length is a big issue. The columns that will be included will likely be subject to change too - otherwise I would just write the big nasty query once and call it a day.

Orlando Colamatteo SSC Guru Points: 182293 More actions · Answer 5

Kevin Dahl (2/22/2011)
I have actually tried doing the pivot and building the IN clause dynamically (using this approach http://www.simple-talk.com/community/blogs/andras/archive/2007/09/14/37265.aspx), but it turns out that the column names actually end up overflowing the max varchar length.

Use NVARCHAR(MAX) instead of NVARCHAR(4000)

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

Kevin Dahl Old Hand Points: 380 More actions · Answer 6

opc.three (2/22/2011)
Kevin Dahl (2/22/2011)
I have actually tried doing the pivot and building the IN clause dynamically (using this approach http://www.simple-talk.com/community/blogs/andras/archive/2007/09/14/37265.aspx), but it turns out that the column names actually end up overflowing the max varchar length.
Use NVARCHAR(MAX) instead of NVARCHAR(4000)

Great tips, thanks guys... The dynamic crosstab was much easier/simpler to build programmatically, and the NVARCHAR(MAX) worked fine as well.

Ordinarily I would leave the summary of the data to the presentation layer, but this particular dataset is going into an OLAP cube.