XML vs Split for parsing of URL type data

  • I didn't want to muddy up someone else's topic, so here's the results of parsing URL data into discrete columns.

    The tally table (from http://www.sqlservercentral.com/scripts/Advanced+SQL/62486/:

    --===== Create and populate the Tally table on the fly

    SELECT TOP 11000 --equates to more than 30 years of dates

    IDENTITY(INT,1,1) AS N

    INTO dbo.Tally

    FROM Master.dbo.SysColumns sc1,

    Master.dbo.SysColumns sc2

    --===== Add a Primary Key to maximize performance

    ALTER TABLE dbo.Tally

    ADD CONSTRAINT PK_Tally_N

    PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100

    --===== Allow the general public to use it

    GRANT SELECT ON dbo.Tally TO PUBLIC

    The data table (provided by Jeff Moden):

    drop table #SourceTable

    SELECT TOP 1000000

    RowNum = IDENTITY(INT,1,1),

    u.Parameters

    INTO #SourceTable

    FROM (--===== Same two lines of data repeated

    SELECT CAST('Subject=Drilling&Category=Drill Header&Status=NO'+

    '&IPPDM_Count=94356&Well_Count=84353&Zone_Status=NO'+

    '&Prod_String_Count=95249&R_Master_W_Count=94353'+

    '&BeginDate=2/3/2008&EndDate=2/5/2008' AS VARCHAR(8000)) AS Parameters

    UNION ALL

    SELECT 'Subject=Zone&Category=Zone Header&Status=YES'+

    '&IPPDM_Count=94356&Well_Count=94356&Zone_Status=NO'+

    '&Prod_String_Count=95249&R_Master_W_Count=94356&AddlZoneInterest=107674') u,

    Master.sys.All_Columns ac1,

    Master.sys.All_Columns ac2

    ALTER TABLE #SourceTable

    ADD PRIMARY KEY CLUSTERED (RowNum)

    Both the tally table and data table were created in db named staging.

    Split technique (from Jeff Moden):

    use staging

    go

    SET ANSI_WARNINGS OFF

    SET STATISTICS TIME ON

    go

    --===== Split the data into an EAV (Entity and Value) cte and reassemble as rows containing the desired columns

    ;WITH

    cteFirstSplit AS

    (--==== Splits data on the "&" character

    SELECT h.RowNum,

    ColNum = (ROW_NUMBER() OVER (ORDER BY Parameters)-1)%10,

    ColVal = SUBSTRING(h.Parameters, t.N+1, CHARINDEX('&', h.Parameters + '&', t.N+1)-t.N-1)

    FROM dbo.Tally t WITH (NOLOCK)

    RIGHT OUTER JOIN --Necessary in case Parameters is NULL

    SourceTable h

    ON SUBSTRING(h.Parameters, t.N, 1) = '='

    --AND t.N < CHARINDEX('Zone_',h.Parameters)

    AND t.N < len(h.Parameters)

    --WHERE RowNum > 900000

    --WHERE RowNum <= 10000

    )

    SELECT RowNum,

    MAX(CASE WHEN ColNum = 0 THEN ColVal ELSE NULL END) AS Subject,

    MAX(CASE WHEN ColNum = 1 THEN ColVal ELSE NULL END) AS Category,

    MAX(CASE WHEN ColNum = 2 THEN ColVal ELSE NULL END) AS Status,

    MAX(CASE WHEN ColNum = 3 THEN ColVal ELSE NULL END) AS IPPDM_Count,

    MAX(CASE WHEN ColNum = 4 THEN ColVal ELSE NULL END) AS Well_Count/*,

    MAX(CASE WHEN ColNum = 5 THEN ColVal ELSE NULL END) AS Zone_Status,

    MAX(CASE WHEN ColNum = 6 THEN ColVal ELSE NULL END) AS Prod_String_Count,

    MAX(CASE WHEN ColNum = 7 THEN ColVal ELSE NULL END) AS R_Master_W_Count,

    MAX(CASE WHEN ColNum = 8 THEN ColVal ELSE NULL END) AS BeginDate,

    MAX(CASE WHEN ColNum = 9 THEN ColVal ELSE NULL END) AS EndDate */

    INTO #Results

    FROM cteFirstSplit

    GROUP BY RowNum

    print @@ROWCOUNT

    SET STATISTICS TIME OFF

    go

    I modified the above script to suffix an ampersand to h.Parameters since the substring would fail if BeginDate was omitted.

    XML technique:

    use staging

    go

    SET ANSI_WARNINGS ON

    SET STATISTICS TIME ON

    SET STATISTICS IO ON

    SET QUOTED_IDENTIFIER ON

    go

    declare @xml XML

    set @xml = (

    select cast('<parm RowNum="'+ cast(rowNum as varchar(12))+ '" '+

    replace(replace(Parameters,'=','="'),'&','" ')+ '"/>' as XML) as data

    from SourceTable

    --where RowNum > 900000

    -- where RowNum <= 10000

    for xml auto, type)

    select D.parm.value('@RowNum','int') as RowNum,

    D.parm.value('@Subject','varchar(255)') as [Subject],

    D.parm.value('@Category','varchar(255)') as Category,

    D.parm.value('@Status','varchar(255)') as [Status],

    D.parm.value('@IPPDM_Count','varchar(255)') as IPPDM_Count,

    D.parm.value('@Well_Count','varchar(255)') as Well_Count /*,

    D.parm.value('@Zone_Status','varchar(255)') as Zone_Status,

    D.parm.value('@Prod_String_Count','varchar(255)') as Prod_String_Count,

    D.parm.value('@R_Master_W_Count','varchar(255)') as R_Master_W_Count,

    D.parm.value('@BeginDate','varchar(255)') as BeginDate,

    D.parm.value('@EndDate','varchar(255)') as EndDate */

    into #results

    from @xml.nodes('/SourceTable/data/parm') as D(parm)

    go

    SET STATISTICS TIME OFF

    go

    I ran both scripts twice for datasets of 10K and 100K rows and 5 fields and 10 fields (fields 6-10 are commented out as is the criteria for the various batch sizes).

    Results:

    [font="Courier New"]

    CPU Elapsed

    Rows Fields Run XML Split Diff XML Split Diff

    10K 5 1 3,625 1,359 62.5% 3,618 1,369 62.2%

    5 2 3,641 1,422 60.9% 3,631 1,409 61.2%

    100K 5 1 35,719 41,985 -17.5% 35,729 42,960 -20.2%

    5 2 35,719 41,686 -16.7% 35,733 42,655 -19.4%

    10K 10 1 4,891 1,750 64.2% 4,882 1,748 64.2%

    10 2 4,907 1,750 64.3% 4,910 1,741 64.5%

    100K 10 1 48,578 45,030 7.3% 48,595 43,770 9.9%

    10 2 48,609 45,563 6.3% 48,654 44,497 8.5%

    1M 10 1 486,813 466,952 4.1% 486,900 482,407 0.9%

    [/font]

    The Split technique wins 3 out of 4 tests, but it's margin of victory was never more than 4 seconds. So, I ran another test with all 1M rows and it's practically a dead heat.

    This was my second cut at a query using XML and I'm pretty sure it can be sped up by eliminating the use of the @xml intermediate variable or extracting all attributes at once.

    -- this doesn't work

    select D.parm ... into #results

    from (select cast(...) as XML) as data

    from SourceTable

    for xml auto, type).nodes('/SourceTable/data/parm') as D(parm)

    -- this doesn't work either probably because it's an attribute collection rather than an element collection

    select *

    from @xml.nodes('/SourceTable/data/parm/@*') as D(RowNum,Subject,Category,...)

    One huge difference between the two techniques is the fact that the XML technique is not position sensitive -- it will extract the fields regardless of order. Also, the XML query is easier to grasp and should be easier to maintain.

    So Jeff, I will buy you a beer, but only a Bud Light (it's practically water anyway). 😉

  • Antonio:

    My tests with XML vs Numbers/Tally table have shown similar CPU time results, but have had interesting IO differences. Did you include IO stats in your runs? (If not, is it something you can run again to test that?)

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • (I have to confess I wasn't as courteous as Antonio, and posted my tests on Numbers/Tally vs XML on the original thread. That's here: http://www.sqlservercentral.com/Forums/Topic478171-338-2.aspx)

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Sorry folks... dunno how I missed this one.

    Antonio... I sure do appreciate the attention to detail you put into the tests you did and the reports that followed. Absolutely awesome job!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Another XML-style approach...

    drop table #Results

    ; with a as (select RowNum, cast('<parm '+ replace(replace(Parameters,'=','="'),'&','" ')+ '"/>' as xml) as Xml from #SourceTable)

    select RowNum,

    Xml.value('(parm[1]/@Subject)[1]', 'varchar(255)') as Subject,

    Xml.value('(parm[1]/@Category)[1]', 'varchar(255)') as Category,

    Xml.value('(parm[1]/@Status)[1]', 'varchar(255)') as Status,

    Xml.value('(parm[1]/@IPPDM_Count)[1]', 'varchar(255)') as IPPDM_Count,

    Xml.value('(parm[1]/@Well_Count)[1]', 'varchar(255)') as Well_Count

    into #Results

    from a

    This seems to run much faster for me, but perhaps I'm misunderstanding the problem?

    Ryan Randall

    Solutions are easy. Understanding the problem, now, that's the hard part.

  • Jeff Moden (4/17/2008)


    Sorry folks... dunno how I missed this one.

    Antonio... I sure do appreciate the attention to detail you put into the tests you did and the reports that followed. Absolutely awesome job!

    Actually, I like your method for figuring out the element names in the first place:

    SELECT distinct

    ColName =

    substring(

    h.Parameters,

    t.N - (charindex('&', reverse('&' + left(h.Parameters, t.N - 1))) - 1),

    charindex('&', reverse('&' + left(h.Parameters, t.N- 1))) - 1

    )

    FROM dbo.Tally t WITH (NOLOCK)

    RIGHT OUTER JOIN --Necessary in case Parameters is NULL

    #SourceTable h

    ON

    SUBSTRING(h.Parameters, t.N, 1) = '='

    AND t.N < len(h.Parameters)

    This runs pretty quickly for me.

    (yes, it's very inelegant. If anyone knows how to more directly find the last occurring character *before* a string position, I'm very interested)

    I tried making the XQuery value pull dynamic, but it insists on a literal string. So, if you want to be completist about it, I think you have to go dynamic SQL.

  • RyanRandall (5/22/2008)


    Another XML-style approach...

    drop table #Results

    ; with a as (select RowNum, cast('<parm '+ replace(replace(Parameters,'=','="'),'&','" ')+ '"/>' as xml) as Xml from #SourceTable)

    select RowNum,

    Xml.value('(parm[1]/@Subject)[1]', 'varchar(255)') as Subject,

    Xml.value('(parm[1]/@Category)[1]', 'varchar(255)') as Category,

    Xml.value('(parm[1]/@Status)[1]', 'varchar(255)') as Status,

    Xml.value('(parm[1]/@IPPDM_Count)[1]', 'varchar(255)') as IPPDM_Count,

    Xml.value('(parm[1]/@Well_Count)[1]', 'varchar(255)') as Well_Count

    into #Results

    from a

    This seems to run much faster for me, but perhaps I'm misunderstanding the problem?

    Nice... I'll throw that into a million row test and check. Thanks, Ryan.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply