Splitting Data from One Column into 5 Columns

  • Jeff Moden (4/1/2008)


    Ok... here's my entry for the virtual beer...

    Looking forward to your entry 😉

    No fair working at night... my wife won't let me get geeky at night.

  • antonio.collins (4/2/2008)


    No fair working at night... my wife won't let me get geeky at night.

    Heh... understood... I keep telling mine "How can I miss you if you won't go away?" 😀 She thinks it's better than me playing pool or bowling because she knows where I am and I don't drink at home. 2 out of 3 ain't bad but I'm not giving up the "WTFs" for anyone, if you know what I mean 😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • antonio.collins (4/2/2008)


    Jeff Moden (4/1/2008)


    Ok... here's my entry for the virtual beer...

    Looking forward to your entry 😉

    No fair working at night... my wife won't let me get geeky at night.

    I'd respond to that...but none of the answers are PG!!!

    😀

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • I created test data as follows:

    create table ParserTest (

    ID int identity primary key,

    String varchar(max))

    go

    insert into dbo.parsertest (string)

    SELECT 'Subject=Drilling&Category=Drill Header&Status=NO&IPPDM_Count=94356'

    +'&Well_Count=84353&Zone_Status=NO&Prod_String_Count=95249&R_Master_W_Count=94353&'

    +'BeginDate=2/3/2008&EndDate=2/5/2008'

    from common.dbo.bignumbers

    insert into dbo.parsertest (string)

    SELECT 'Subject=Zone&Category=Zone Header&Status=YES&IPPDM_Count=94356&Well_Count='

    +'94356&Zone_Status=NO&Prod_String_Count=95249&R_Master_W_Count=94356&AddlZoneInterest'

    +'=107674'

    from common.dbo.bignumbers

    Then I tested a Numbers Table solution. I used the Pivot function, instead of Jeff's solution, but otherwise quite similar:

    set statistics io on

    set statistics time on

    ;with

    Parsed as

    (select id, SUBSTRING(String + '&', number,

    CHARINDEX('&', String + '&', number) - number) as Parsed,

    row_number() over (order by number) as Row

    FROM common.dbo.numbers

    inner join dbo.parsertest

    on number <= LEN(String)

    and SUBSTRING('&' + String, number, 1) = '&'),

    EAV (ID, Col, Val) as

    (select id, left(parsed, charindex('=', parsed)-1) Col,

    reverse(left(reverse(parsed), charindex('=', reverse(parsed))-1)) Val

    from parsed)

    select ID, Subject, Category, Status, IPPDM_Count,

    Well_Count, Zone_Status, Prod_String_Count, BeginDate, EndDate

    from

    (select id, col, val

    from eav) sub

    pivot

    (max(val)

    for col in (Subject, Category, Status, IPPDM_Count,

    Well_Count, Zone_Status, Prod_String_Count, BeginDate, EndDate)) pvt

    It's 2M rows (instead of the 1M Jeff proposed), and I got these results:

    =============================

    (2000002 row(s) affected)

    Table 'Numbers'. Scan count 2000002, logical reads 6000006, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'ParserTest'. Scan count 1, logical reads 47907, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 4781359 ms, elapsed time = 4807388 ms.

    =============================

    Next is an XML parser test. (Separate post, these take WAY too long to run on this many rows on this machine.)

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • XML Function test:

    ALTER function [dbo].[StringParserXML]

    (@String_in varchar(max),

    @Delimiter_in char(10))

    returns @Parsed table (

    Parsed varchar(100))

    as

    -- This one is faster than StringParser, but it doesn't correctly handle

    -- XML-specific characters, such as "<" or "&". StringParser will handle those

    -- without difficulty.

    begin

    declare @XML xml

    select @xml = ' '

    insert into @parsed(parsed)

    select x.i.value('.', 'varchar(100)')

    from @xml.nodes('//i') x(i)

    return

    end

    The test:

    set statistics io on

    set statistics time on

    ;with

    Parsed as

    (select id, parsed

    FROM dbo.parsertest

    cross apply common.dbo.stringparserxml(string, '&')),

    EAV (ID, Col, Val) as

    (select id, left(parsed, charindex('=', parsed)-1) Col,

    reverse(left(reverse(parsed), charindex('=', reverse(parsed))-1)) Val

    from parsed)

    select ID, Subject, Category, Status, IPPDM_Count,

    Well_Count, Zone_Status, Prod_String_Count, BeginDate, EndDate

    from

    (select id, col, val

    from eav) sub

    pivot

    (max(val)

    for col in (Subject, Category, Status, IPPDM_Count,

    Well_Count, Zone_Status, Prod_String_Count, BeginDate, EndDate)) pvt

    The results:

    ==============================

    SQL Server parse and compile time:

    CPU time = 13 ms, elapsed time = 13 ms.

    (2000002 row(s) affected)

    Table '#5B4453CB'. Scan count 2, logical reads 3000003, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'ParserTest'. Scan count 1, logical reads 47907, physical reads 14999, read-ahead reads 288, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 220125 ms, elapsed time = 227719 ms.

    ==============================

    The Numbers table version I used took over an hour (4781359 ms), while the XML version took just under 4 minutes (220125 ms). The XML version also had a HUGE advantage in terms of scan counts, which matters on a system with a lot of I/O bottlenecks.

    Either I'm doing something horribly wrong in the Numbers table version, or the XML totally kicked the Numbers table version's ***.

    Jeff, please tell me what I got wrong on the Numbers table version. I'm too tired (long day) and possibly just not seeing it right.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Just in case the Cross Apply is somehow more efficient than a straight up join (which is ridiculous on the face of it), I also tested:

    ALTER function [dbo].[StringParser]

    (@String_in varchar(max),

    @Delimiter_in char(1))

    returns table

    as

    return(

    SELECT top 100 percent

    SUBSTRING(@String_in+@Delimiter_in, number,

    CHARINDEX(@Delimiter_in, @String_in+@Delimiter_in, number) - number) as Parsed,

    row_number() over (order by number) as Row

    FROM numbers

    WHERE number <= LEN(@String_in)

    AND SUBSTRING(@Delimiter_in + @String_in, number, 1) = @Delimiter_in

    ORDER BY number

    )

    With this as the test:

    ;with

    Parsed as

    (select id, parsed

    FROM dbo.parsertest

    cross apply common.dbo.stringparser(string, '&')),

    EAV (ID, Col, Val) as

    (select id, left(parsed, charindex('=', parsed)-1) Col,

    reverse(left(reverse(parsed), charindex('=', reverse(parsed))-1)) Val

    from parsed)

    select ID, Subject, Category, Status, IPPDM_Count,

    Well_Count, Zone_Status, Prod_String_Count, BeginDate, EndDate

    from

    (select id, col, val

    from eav) sub

    pivot

    (max(val)

    for col in (Subject, Category, Status, IPPDM_Count,

    Well_Count, Zone_Status, Prod_String_Count, BeginDate, EndDate)) pvt

    I didn't get results, because I killed it after it had been running for a bit over 8 minutes, since the XML test took less than 4 minutes.

    (I thought an interesting thing in all of these tests is that I got the first rows back almost instantly. I would have thought the Pivot function would be a "hidden RBAR" that would build a result set in tempdb, and then would run through it to pivot the results, but apparently it doesn't work that way.)

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Hi Jeff,

    Your code works fine as per my original request. There is an error on my part on the requirement, I apolozise. Apparently, there are Rows that are NULL, less than 5 parameter items(eg. Subject, Category, Status). ..more than 5 parameter items, which I failed to mentioned. So therefore, 1st 5 items extraction from the parameter text would not work in my case. I probably need to search by item name Charidex(Subject=, Category=,Status= etc.) and then extract the values after the '=' (eg. Drilling, Drill Header, Yes, 4567, 4567)

    Below is the table script and sample data of my SOURCE and also the Expected format of the RESULT set I'm trying to accomplish. LEt me know if you need additional info.

    Thanks again!

    -Tash

    --===== If the test source table already exists, drop it

    DROP TABLE #SourceTable

    --===== Create the test table with

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    CREATE TABLE #SourceTable(

    [ReportKey] [int] IDENTITY(1,1) PRIMARY KEY CLUSTERED,

    [Parameters] [ntext] COLLATE SQL_Latin1_General_CP1_CI_AS NULL, [TimeStart] [datetime] NOT NULL,

    )

    GO

    --===== All Inserts into the IDENTITY column

    SET IDENTITY_INSERT #SourceTable ON

    --===== Insert the test data into the test table

    INSERT INTO #SourceTable

    (ReportKey, Parameters, TimeStart)

    SELECT '2181','Subject=Zone&Category=Address&CountTable_B=NO&Zone_Count=95220&R_Master_Zone=95200','Feb 12 2008 5:18PM' UNION ALL

    SELECT '2923','Subject=Drilling&Category=Drill Header&Status=YES&Well_Count=94356&Zone_Status=NO&Prod_String_Count=95249&R_Master_W_Count=94356','Mar 10 2008 2:20PM' UNION ALL

    SELECT '1284','Subject=Zone&Category=Address&CountTable_B=NO&Zone_Count=95229&R_Master_Zone=95224','Jan 20 2008 6:20PM' UNION ALL

    SELECT '2513','Division=1&District=105&Date=03/28/2008 17:35:08','Mar 28 2008 5:35PM' UNION ALL

    SELECT '134',NULL, 'Mar 28 2008 5:35PM' UNION ALL

    SELECT '2169','begindate=03/11/2008 00:00:00&enddate=03/11/2008 00:00:00','Mar 11 2008 3:42PM' UNION ALL

    SELECT '254','Subject=Zone&Category=Zone Header&Status=YES&IPPDM_Count=94356&Well_Count=94356&Zone_Status=NO&Prod_String_Count=95249&R_Master_W_Count=94356&AddlZoneInterest=107674', 'Mar 28 2008 5:35PM' UNION ALL

    SELECT '468','Category=Drill Header&Status=NO&IPPDM_Count=9435&Well_Count=8433&Zone_Status=NO&Prod_String_Count=95249&R_Master_W_Count=94353&BeginDate=2/3/2008&EndDate=2/5/2008', 'Mar 28 2008 5:35PM' UNION ALL

    SELECT '182','Division=1&Date=01/14/2008 09:38:27&District=94','Jan 15 2008 10:38PM' UNION ALL

    SELECT '112','Subject=Zone&Category=Address&IPPDM_Count=94356&Well_Count=94356','Jan 21 2008 7:20PM' UNION ALL

    SELECT '9854',NULL, 'Mar 28 2008 5:35PM' UNION ALL

    SELECT '167','WellID=616311&reportdate=1/1/2008 1:30:00 PM','Jan 7 2008 10:05AM' UNION ALL

    SELECT '2523','Subject=Drilling&Status=YES&IPPDM_Count=94356&Well_Count=94356&Zone_Status=NO&Prod_String_Count=95249&R_Master_W_Count=94356','Mar 21 2008 3:20PM' UNION ALL

    SELECT '654',NULL, 'Mar 28 2008 5:35PM' UNION ALL

    SELECT '2526','Division=2&District=99&Date=03/15/2008 06:35:01','Mar 15 2008 6:35AM' UNION ALL

    SELECT '1285','Date=01/13/2008 21:50:09','Jan 13 2008 9:50PM' UNION ALL

    SELECT '1326',NULL, 'Mar 28 2008 5:35PM' UNION ALL

    SELECT '2563','WellID=617521&WellName:isnull=true','Mar 26 2008 8:26AM' UNION ALL

    SELECT '192','Subject=Zone&Category=Address&IPPDM_Count=94356&Zone_Status=NO','Dec 10 2007 12:20PM'

    --===== Set the identity insert back to normal

    SET IDENTITY_INSERT #SourceTable ON

    ----===========Output table============

    --===== If the test output table already exists, drop it

    DROP TABLE #ResultTable

    --===== Create the test output table with

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    CREATE TABLE #ResultTable(

    [ReportKey] [int] IDENTITY(1,1) PRIMARY KEY CLUSTERED,

    [Subject] [nvarchar] (200) NULL,

    [Category] [nvarchar] (200) NULL,

    [Status] [nvarchar] (200) NULL,

    [Ippdm_Count] [nvarchar] (200) NULL,

    [Well_Count] [nvarchar] (200) NULL,

    [TimeStart] [datetime] NOT NULL,

    )

    GO

    ----------------------Resutlset should look like this-----------------Below is sample output sample data-----------

    --===== All Inserts into the IDENTITY column

    SET IDENTITY_INSERT #ResultTable ON

    --===== Insert the test OUTPUT data into the test table

    INSERT INTO #ResultTable

    (ReportKey, Subject,Category, Status,Ippdm_Count,Well_Count,TimeStart)

    SELECT '2181','Zone','Address','NO',NULL,'9510','Feb 12 2008 5:18PM' UNION ALL

    SELECT '1284','Zone',NULL,'NO','95229','95224','Jan 20 2008 6:20PM' UNION ALL

    SELECT '134',NULL,NULL,NULL,NULL,NULL,'Mar 28 2008 5:35PM' UNION ALL

    SELECT '2923',NULL,'Drill Header','YES','94356','94356','Mar 10 2008 2:20PM' UNION ALL

    SELECT '9854',NULL,NULL,NULL,NULL,NULL,'Mar 28 2008 5:35PM' UNION ALL

    SELECT '192','Zone','Address',NULL,'94356','94356','Dec 10 2007 12:20PM'

    --===== Set the identity insert back to normal

    SET IDENTITY_INSERT #ResultTable ON

  • declare @xml XML

    -- convert all row data to an XML element

    set @xml = (

    select cast('<parm ReportKey="'+ cast(ReportKey as varchar(12))

    +'" TimeStart="'+ cast(TimeStart as varchar(32))+ '" '

    + replace(replace(Parameters,'=','="'),'&','" ')+ '"/>' as XML) as data

    from #SourceTable

    for xml auto, type)

    select D.parm.value('@ReportKey','int') as ReportKey,

    D.parm.value('@Subject','varchar(255)') as [Subject],

    D.parm.value('@Category','varchar(255)') as Category,

    D.parm.value('@Status','varchar(255)') as [Status],

    D.parm.value('@IPPDM_Count','varchar(255)') as IPPDM_Count,

    D.parm.value('@Well_Count','varchar(255)') as Well_Count,

    D.parm.value('@TimeStart','smalldatetime')

    from @xml.nodes('/SourceTable/data/parm') as D(parm)

    go

    see http://www.sqlservercentral.com/Forums/Topic478799-338-1.aspx for more info.

  • GSquared (4/2/2008)


    Jeff, please tell me what I got wrong on the Numbers table version

    At a high level, yes... You split on the "&" instead of the "=" which forces you into a second CTE. That and the REVERSE function (3 times) is very expensive.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Antonio,

    Your solution doesn't work with the test data I created. Please post your test data generator and the result times you got.

    Gus, you're splitting out more than requested and you haven't printed any of the result times.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (4/2/2008)


    Antonio,

    Your solution doesn't work with the test data I created. Please post your test data generator and the result times you got.

    Sorry. The Parameters column became ntext since the last time I looked. Also, I was using a physical SourceTable, not a temp table. Anyway, simple enough to correct. I also added some comments since the multiple replaces can get confusing. The block below works perfectly with the sample data script posted earlier.

    declare @xml XML

    -- convert all row data to an XML element

    set @xml = (

    select cast('<parm ReportKey="'+ cast(ReportKey as varchar(12))

    +'" TimeStart="'+ cast(TimeStart as varchar(32))+ '" '

    + isnull(

    replace(

    replace(

    replace( cast(Parameters as varchar(max)),

    '=','="'), -- add leading quote to value

    '&','" ') + '"', -- remove leading ampersand and add trailing quote to value

    ':isnull=','_isnull='), -- handle bitfield; suffix acts like a namespace

    '') -- handle null parameters

    + '/>' as XML) as data -- replace XML with varchar(max) to see the constructed set

    from #SourceTable

    for xml raw('SourceData')) -- must use 'raw(tag)' if using temp table; 'auto' uses table name as parent element

    select D.parm.value('@ReportKey','int') as ReportKey,

    D.parm.value('@Subject','varchar(255)') as [Subject],

    D.parm.value('@Category','varchar(255)') as Category,

    D.parm.value('@Status','varchar(255)') as [Status],

    D.parm.value('@IPPDM_Count','varchar(255)') as IPPDM_Count,

    D.parm.value('@Well_Count','varchar(255)') as Well_Count,

    D.parm.value('@TimeStart','smalldatetime') as TimeStart

    from @xml.nodes('/SourceData/data/parm') as D(parm) -- .nodes() parameter is XPath selector

  • Jeff Moden (4/2/2008)


    GSquared (4/2/2008)


    Jeff, please tell me what I got wrong on the Numbers table version

    At a high level, yes... You split on the "&" instead of the "=" which forces you into a second CTE. That and the REVERSE function (3 times) is very expensive.

    Yeah, but the XML version is doing exactly the same thing in those regards, and finished in 1/20th the time of the Numbers table version. If those were what was causing the delay, both would be slowed down.

    I'm trying to figure out how the Numbers table split ended up so slow compared to the XML split. I think it's all the scans, but I'm not sure how to eliminate those. That's what I'm looking for help on.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Jeff Moden (4/2/2008)


    Antonio,

    Your solution doesn't work with the test data I created. Please post your test data generator and the result times you got.

    Gus, you're splitting out more than requested and you haven't printed any of the result times.

    I included result times in both tests. Look on page 2 of this thread.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Many many thanks Antonio. Your XML solution gets me the expected result. I'd also like to thank GSquard and Jeff for their contribution.

    -Lalon

Viewing 14 posts - 16 through 28 (of 28 total)

You must be logged in to reply to this topic. Login to reply