Splitting Data from One Column into 5 Columns

Question

Post reply

Splitting Data from One Column into 5 Columns

Viewing 14 posts - 16 through 28 (of 28 total)

You must be logged in to reply to this topic. Login to reply

antonio.collins SSCrazy Eights Points: 9600 More actions · Answer 1

Jeff Moden (4/1/2008)
Ok... here's my entry for the virtual beer...
Looking forward to your entry 😉

No fair working at night... my wife won't let me get geeky at night.

Jeff Moden SSC Guru Points: 1004340 More actions · Answer 2

antonio.collins (4/2/2008)
No fair working at night... my wife won't let me get geeky at night.

Heh... understood... I keep telling mine "How can I miss you if you won't go away?" 😀 She thinks it's better than me playing pool or bowling because she knows where I am and I don't drink at home. 2 out of 3 ain't bad but I'm not giving up the "WTFs" for anyone, if you know what I mean 😛

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 3

antonio.collins (4/2/2008)
Jeff Moden (4/1/2008)
Ok... here's my entry for the virtual beer...
Looking forward to your entry 😉
No fair working at night... my wife won't let me get geeky at night.

I'd respond to that...but none of the answers are PG!!!

😀

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

GSquared SSC Guru Points: 260824 More actions · Answer 4

I created test data as follows:

create table ParserTest (

ID int identity primary key,

String varchar(max))

go

insert into dbo.parsertest (string)

SELECT 'Subject=Drilling&Category=Drill Header&Status=NO&IPPDM_Count=94356'

+'&Well_Count=84353&Zone_Status=NO&Prod_String_Count=95249&R_Master_W_Count=94353&'

+'BeginDate=2/3/2008&EndDate=2/5/2008'

from common.dbo.bignumbers

insert into dbo.parsertest (string)

SELECT 'Subject=Zone&Category=Zone Header&Status=YES&IPPDM_Count=94356&Well_Count='

+'94356&Zone_Status=NO&Prod_String_Count=95249&R_Master_W_Count=94356&AddlZoneInterest'

+'=107674'

from common.dbo.bignumbers

Then I tested a Numbers Table solution. I used the Pivot function, instead of Jeff's solution, but otherwise quite similar:

set statistics io on

set statistics time on

;with

Parsed as

(select id, SUBSTRING(String + '&', number,

CHARINDEX('&', String + '&', number) - number) as Parsed,

row_number() over (order by number) as Row

FROM common.dbo.numbers

inner join dbo.parsertest

on number <= LEN(String)

and SUBSTRING('&' + String, number, 1) = '&'),

EAV (ID, Col, Val) as

(select id, left(parsed, charindex('=', parsed)-1) Col,

reverse(left(reverse(parsed), charindex('=', reverse(parsed))-1)) Val

from parsed)

select ID, Subject, Category, Status, IPPDM_Count,

Well_Count, Zone_Status, Prod_String_Count, BeginDate, EndDate

from

(select id, col, val

from eav) sub

pivot

(max(val)

for col in (Subject, Category, Status, IPPDM_Count,

Well_Count, Zone_Status, Prod_String_Count, BeginDate, EndDate)) pvt

It's 2M rows (instead of the 1M Jeff proposed), and I got these results:

=============================

(2000002 row(s) affected)

Table 'Numbers'. Scan count 2000002, logical reads 6000006, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'ParserTest'. Scan count 1, logical reads 47907, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 4781359 ms, elapsed time = 4807388 ms.

=============================

Next is an XML parser test. (Separate post, these take WAY too long to run on this many rows on this machine.)

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

GSquared SSC Guru Points: 260824 More actions · Answer 5

XML Function test:

ALTER function [dbo].[StringParserXML]

(@String_in varchar(max),

@Delimiter_in char(10))

returns @Parsed table (

Parsed varchar(100))

as

-- This one is faster than StringParser, but it doesn't correctly handle

-- XML-specific characters, such as "<" or "&". StringParser will handle those

-- without difficulty.

begin

declare @XML xml

select @xml = ' '

insert into @parsed(parsed)

select x.i.value('.', 'varchar(100)')

from @xml.nodes('//i') x(i)

return

end

The test:

set statistics io on

set statistics time on

;with

Parsed as

(select id, parsed

FROM dbo.parsertest

cross apply common.dbo.stringparserxml(string, '&')),

EAV (ID, Col, Val) as

(select id, left(parsed, charindex('=', parsed)-1) Col,

reverse(left(reverse(parsed), charindex('=', reverse(parsed))-1)) Val

from parsed)

select ID, Subject, Category, Status, IPPDM_Count,

Well_Count, Zone_Status, Prod_String_Count, BeginDate, EndDate

from

(select id, col, val

from eav) sub

pivot

(max(val)

for col in (Subject, Category, Status, IPPDM_Count,

Well_Count, Zone_Status, Prod_String_Count, BeginDate, EndDate)) pvt

The results:

==============================

SQL Server parse and compile time:

CPU time = 13 ms, elapsed time = 13 ms.

(2000002 row(s) affected)

Table '#5B4453CB'. Scan count 2, logical reads 3000003, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'ParserTest'. Scan count 1, logical reads 47907, physical reads 14999, read-ahead reads 288, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 220125 ms, elapsed time = 227719 ms.

==============================

The Numbers table version I used took over an hour (4781359 ms), while the XML version took just under 4 minutes (220125 ms). The XML version also had a HUGE advantage in terms of scan counts, which matters on a system with a lot of I/O bottlenecks.

Either I'm doing something horribly wrong in the Numbers table version, or the XML totally kicked the Numbers table version's ***.

Jeff, please tell me what I got wrong on the Numbers table version. I'm too tired (long day) and possibly just not seeing it right.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

GSquared SSC Guru Points: 260824 More actions · Answer 6

Just in case the Cross Apply is somehow more efficient than a straight up join (which is ridiculous on the face of it), I also tested:

ALTER function [dbo].[StringParser]

(@String_in varchar(max),

@Delimiter_in char(1))

returns table

as

return(

SELECT top 100 percent

SUBSTRING(@String_in+@Delimiter_in, number,

CHARINDEX(@Delimiter_in, @String_in+@Delimiter_in, number) - number) as Parsed,

row_number() over (order by number) as Row

FROM numbers

WHERE number <= LEN(@String_in)

AND SUBSTRING(@Delimiter_in + @String_in, number, 1) = @Delimiter_in

ORDER BY number

)

With this as the test:

;with

Parsed as

(select id, parsed

FROM dbo.parsertest

cross apply common.dbo.stringparser(string, '&')),

EAV (ID, Col, Val) as

(select id, left(parsed, charindex('=', parsed)-1) Col,

reverse(left(reverse(parsed), charindex('=', reverse(parsed))-1)) Val

from parsed)

select ID, Subject, Category, Status, IPPDM_Count,

Well_Count, Zone_Status, Prod_String_Count, BeginDate, EndDate

from

(select id, col, val

from eav) sub

pivot

(max(val)

for col in (Subject, Category, Status, IPPDM_Count,

Well_Count, Zone_Status, Prod_String_Count, BeginDate, EndDate)) pvt

I didn't get results, because I killed it after it had been running for a bit over 8 minutes, since the XML test took less than 4 minutes.

(I thought an interesting thing in all of these tests is that I got the first rows back almost instantly. I would have thought the Pivot function would be a "hidden RBAR" that would build a result set in tempdb, and then would run through it to pivot the results, but apparently it doesn't work that way.)

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

lalonacademy SSC-Addicted Points: 497 More actions · Answer 7

Hi Jeff,

Your code works fine as per my original request. There is an error on my part on the requirement, I apolozise. Apparently, there are Rows that are NULL, less than 5 parameter items(eg. Subject, Category, Status). ..more than 5 parameter items, which I failed to mentioned. So therefore, 1st 5 items extraction from the parameter text would not work in my case. I probably need to search by item name Charidex(Subject=, Category=,Status= etc.) and then extract the values after the '=' (eg. Drilling, Drill Header, Yes, 4567, 4567)

Below is the table script and sample data of my SOURCE and also the Expected format of the RESULT set I'm trying to accomplish. LEt me know if you need additional info.

Thanks again!

-Tash

--===== If the test source table already exists, drop it

DROP TABLE #SourceTable

--===== Create the test table with

SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

CREATE TABLE #SourceTable(

[ReportKey] [int] IDENTITY(1,1) PRIMARY KEY CLUSTERED,

[Parameters] [ntext] COLLATE SQL_Latin1_General_CP1_CI_AS NULL, [TimeStart] [datetime] NOT NULL,

)

GO

--===== All Inserts into the IDENTITY column

SET IDENTITY_INSERT #SourceTable ON

--===== Insert the test data into the test table

INSERT INTO #SourceTable

(ReportKey, Parameters, TimeStart)

SELECT '2181','Subject=Zone&Category=Address&CountTable_B=NO&Zone_Count=95220&R_Master_Zone=95200','Feb 12 2008 5:18PM' UNION ALL

SELECT '2923','Subject=Drilling&Category=Drill Header&Status=YES&Well_Count=94356&Zone_Status=NO&Prod_String_Count=95249&R_Master_W_Count=94356','Mar 10 2008 2:20PM' UNION ALL

SELECT '1284','Subject=Zone&Category=Address&CountTable_B=NO&Zone_Count=95229&R_Master_Zone=95224','Jan 20 2008 6:20PM' UNION ALL

SELECT '2513','Division=1&District=105&Date=03/28/2008 17:35:08','Mar 28 2008 5:35PM' UNION ALL

SELECT '134',NULL, 'Mar 28 2008 5:35PM' UNION ALL

SELECT '2169','begindate=03/11/2008 00:00:00&enddate=03/11/2008 00:00:00','Mar 11 2008 3:42PM' UNION ALL

SELECT '254','Subject=Zone&Category=Zone Header&Status=YES&IPPDM_Count=94356&Well_Count=94356&Zone_Status=NO&Prod_String_Count=95249&R_Master_W_Count=94356&AddlZoneInterest=107674', 'Mar 28 2008 5:35PM' UNION ALL

SELECT '468','Category=Drill Header&Status=NO&IPPDM_Count=9435&Well_Count=8433&Zone_Status=NO&Prod_String_Count=95249&R_Master_W_Count=94353&BeginDate=2/3/2008&EndDate=2/5/2008', 'Mar 28 2008 5:35PM' UNION ALL

SELECT '182','Division=1&Date=01/14/2008 09:38:27&District=94','Jan 15 2008 10:38PM' UNION ALL

SELECT '112','Subject=Zone&Category=Address&IPPDM_Count=94356&Well_Count=94356','Jan 21 2008 7:20PM' UNION ALL

SELECT '9854',NULL, 'Mar 28 2008 5:35PM' UNION ALL

SELECT '167','WellID=616311&reportdate=1/1/2008 1:30:00 PM','Jan 7 2008 10:05AM' UNION ALL

SELECT '2523','Subject=Drilling&Status=YES&IPPDM_Count=94356&Well_Count=94356&Zone_Status=NO&Prod_String_Count=95249&R_Master_W_Count=94356','Mar 21 2008 3:20PM' UNION ALL

SELECT '654',NULL, 'Mar 28 2008 5:35PM' UNION ALL

SELECT '2526','Division=2&District=99&Date=03/15/2008 06:35:01','Mar 15 2008 6:35AM' UNION ALL

SELECT '1285','Date=01/13/2008 21:50:09','Jan 13 2008 9:50PM' UNION ALL

SELECT '1326',NULL, 'Mar 28 2008 5:35PM' UNION ALL

SELECT '2563','WellID=617521&WellName:isnull=true','Mar 26 2008 8:26AM' UNION ALL

SELECT '192','Subject=Zone&Category=Address&IPPDM_Count=94356&Zone_Status=NO','Dec 10 2007 12:20PM'

--===== Set the identity insert back to normal

SET IDENTITY_INSERT #SourceTable ON

----===========Output table============

--===== If the test output table already exists, drop it

DROP TABLE #ResultTable

--===== Create the test output table with

SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

CREATE TABLE #ResultTable(

[ReportKey] [int] IDENTITY(1,1) PRIMARY KEY CLUSTERED,

[Subject] [nvarchar] (200) NULL,

[Category] [nvarchar] (200) NULL,

[Status] [nvarchar] (200) NULL,

[Ippdm_Count] [nvarchar] (200) NULL,

[Well_Count] [nvarchar] (200) NULL,

[TimeStart] [datetime] NOT NULL,

)

GO

----------------------Resutlset should look like this-----------------Below is sample output sample data-----------

--===== All Inserts into the IDENTITY column

SET IDENTITY_INSERT #ResultTable ON

--===== Insert the test OUTPUT data into the test table

INSERT INTO #ResultTable

(ReportKey, Subject,Category, Status,Ippdm_Count,Well_Count,TimeStart)

SELECT '2181','Zone','Address','NO',NULL,'9510','Feb 12 2008 5:18PM' UNION ALL

SELECT '1284','Zone',NULL,'NO','95229','95224','Jan 20 2008 6:20PM' UNION ALL

SELECT '134',NULL,NULL,NULL,NULL,NULL,'Mar 28 2008 5:35PM' UNION ALL

SELECT '2923',NULL,'Drill Header','YES','94356','94356','Mar 10 2008 2:20PM' UNION ALL

SELECT '9854',NULL,NULL,NULL,NULL,NULL,'Mar 28 2008 5:35PM' UNION ALL

SELECT '192','Zone','Address',NULL,'94356','94356','Dec 10 2007 12:20PM'

--===== Set the identity insert back to normal

SET IDENTITY_INSERT #ResultTable ON

antonio.collins SSCrazy Eights Points: 9600 More actions · Answer 8

declare @xml XML

-- convert all row data to an XML element

set @xml = (

select cast('<parm ReportKey="'+ cast(ReportKey as varchar(12))

+'" TimeStart="'+ cast(TimeStart as varchar(32))+ '" '

+ replace(replace(Parameters,'=','="'),'&','" ')+ '"/>' as XML) as data

from #SourceTable

for xml auto, type)

select D.parm.value('@ReportKey','int') as ReportKey,

D.parm.value('@Subject','varchar(255)') as [Subject],

D.parm.value('@Category','varchar(255)') as Category,

D.parm.value('@Status','varchar(255)') as [Status],

D.parm.value('@IPPDM_Count','varchar(255)') as IPPDM_Count,

D.parm.value('@Well_Count','varchar(255)') as Well_Count,

D.parm.value('@TimeStart','smalldatetime')

from @xml.nodes('/SourceTable/data/parm') as D(parm)

go

see http://www.sqlservercentral.com/Forums/Topic478799-338-1.aspx for more info.

Jeff Moden SSC Guru Points: 1004340 More actions · Answer 9

GSquared (4/2/2008)
Jeff, please tell me what I got wrong on the Numbers table version

At a high level, yes... You split on the "&" instead of the "=" which forces you into a second CTE. That and the REVERSE function (3 times) is very expensive.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004340 More actions · Answer 10

Antonio,

Your solution doesn't work with the test data I created. Please post your test data generator and the result times you got.

Gus, you're splitting out more than requested and you haven't printed any of the result times.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

antonio.collins SSCrazy Eights Points: 9600 More actions · Answer 11

Jeff Moden (4/2/2008)
Antonio,
Your solution doesn't work with the test data I created. Please post your test data generator and the result times you got.

Sorry. The Parameters column became ntext since the last time I looked. Also, I was using a physical SourceTable, not a temp table. Anyway, simple enough to correct. I also added some comments since the multiple replaces can get confusing. The block below works perfectly with the sample data script posted earlier.

declare @xml XML

-- convert all row data to an XML element

set @xml = (

select cast('<parm ReportKey="'+ cast(ReportKey as varchar(12))

+'" TimeStart="'+ cast(TimeStart as varchar(32))+ '" '

+ isnull(

replace(

replace( cast(Parameters as varchar(max)),

'=','="'), -- add leading quote to value

'&','" ') + '"', -- remove leading ampersand and add trailing quote to value

':isnull=','_isnull='), -- handle bitfield; suffix acts like a namespace

'') -- handle null parameters

+ '/>' as XML) as data -- replace XML with varchar(max) to see the constructed set

from #SourceTable

for xml raw('SourceData')) -- must use 'raw(tag)' if using temp table; 'auto' uses table name as parent element

select D.parm.value('@ReportKey','int') as ReportKey,

D.parm.value('@Subject','varchar(255)') as [Subject],

D.parm.value('@Category','varchar(255)') as Category,

D.parm.value('@Status','varchar(255)') as [Status],

D.parm.value('@IPPDM_Count','varchar(255)') as IPPDM_Count,

D.parm.value('@Well_Count','varchar(255)') as Well_Count,

D.parm.value('@TimeStart','smalldatetime') as TimeStart

from @xml.nodes('/SourceData/data/parm') as D(parm) -- .nodes() parameter is XPath selector

GSquared SSC Guru Points: 260824 More actions · Answer 12

Jeff Moden (4/2/2008)
GSquared (4/2/2008)
Jeff, please tell me what I got wrong on the Numbers table version
At a high level, yes... You split on the "&" instead of the "=" which forces you into a second CTE. That and the REVERSE function (3 times) is very expensive.

Yeah, but the XML version is doing exactly the same thing in those regards, and finished in 1/20th the time of the Numbers table version. If those were what was causing the delay, both would be slowed down.

I'm trying to figure out how the Numbers table split ended up so slow compared to the XML split. I think it's all the scans, but I'm not sure how to eliminate those. That's what I'm looking for help on.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

GSquared SSC Guru Points: 260824 More actions · Answer 13

Jeff Moden (4/2/2008)
Antonio,
Your solution doesn't work with the test data I created. Please post your test data generator and the result times you got.
Gus, you're splitting out more than requested and you haven't printed any of the result times.

I included result times in both tests. Look on page 2 of this thread.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

lalonacademy SSC-Addicted Points: 497 More actions · Answer 14

Many many thanks Antonio. Your XML solution gets me the expected result. I'd also like to thank GSquard and Jeff for their contribution.

-Lalon