OPENXML: 11MB XML file: is 25 seconds to parse and load normal?

  • I have a stored procedure which is taking about 25 seconds to parse an 11MB XML document and load the data into table variables. I'm trying to figure out if this is normal.

    The specs:

    XML is all attribute-centric

    XML has 8 main nodes which are loaded into 8 table variables

    the largest node has 33,000 records with 50 columns to check for (I didn't design that)

    Total number of records is 44,000

    2 table variables have 50 columns

    I basically do this eight times:

    INSERT @table_var

    SELECT

    col1,

    ...coln

    FROM OPENXML (@idoc, '\amespace:mainnodeamespace:nodeitem')

    WITH (

    col1 datatype,

    ...coln datatype

    )

    Does 25-28 seconds sound normal? This process was created for performance improvement but the C# code it's supposed to replace can iterate through the XML (it's transformed into an object) and do a series of 44,000 stored procedure calls (one line inserts) faster than SQL Server can load the XML.

    I haven't posted the code because I would have to modify everything (NDA). I can do that if necessary. My main question is in regards to the time required and if that sounds normal for the amount of data and XML size.

  • Try to use the "nodes" method of the xml datatype (see BOL). You can also use an "xml schema collection" for the xml data.

  • Thanks. I tried the nodes method as well but that performance was even worse. I would like to try the nodes method with a schema collection (which I've read can improve things drastically for nodes) but our schemas come from C# and are not entirely SQL Server compliant. But I'm wondering if 25 seconds sounds reasonable for the scenario described...

  • 8kb (9/21/2009)


    But I'm wondering if 25 seconds sounds reasonable for the scenario described...

    As usual: it depends.

    A few questions (aside of the standard questions for server load, available CPU a.s.o.):

    How many levels does the xml strucute have?

    What's your general process to split the data? (Are you 'climbing up the node levels' or do you use the CROSS APPLY function?)

    What I would do is to load the file into a temp table with xml column, add an xml index and use XQuery with CROSS APPLY to split the data.

    That shouldn't take 25sec.



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Laptop specs are as follows:

    SQL Server Express SP 3

    Dell Latitude E6400

    3.5 GB RAM

    Core Duo 2.53 Ghz

    The xml structure is below:

    ="xml"

    The method of getting the data is going through each main node and get all of its sub nodes (one level below):

    Example:

    INSERT @List1

    SELECT

    col1,

    ...

    coln

    FROM OPENXML (@doc_id, 'Root\Items\List1\List1_item', 1)

    WITH (

    col1 datatype,

    ...

    coln datatype

    )

    INSERT @List2

    SELECT

    col1,

    ...

    coln

    FROM OPENXML (@doc_id, 'Root\Items\List2\List2_Item', 1)

    WITH (

    col1 datatype,

    ...

    coln datatype

    )

    and so on...

    The big performance hit is when you get to List4. There are fifty columns and it takes 10-13 seconds just to extract the data.

    There is no load on the machine while the tests are running except for basic background processes in XP.

    CROSS APPLY sounds interesting. Actually we're going to roll back to the original method of running straight inserts from C# because it's so much faster. But I'm wondering if OPENXML slows down the process every time you do an extraction of data. Especially with bigger files.

  • I'm still not sure what your query looks like.

    The way you posted your sample data it appears to be a query against just one level.

    If you're staying within one node CROSS APPLY wouldn't be necessary from my point of view.

    The following code should work and shouldn't take too much time.

    It would be a totally different story if you'd had to climb up and down the nodes or even parallel nodes. But there's just not enough information....

    SELECT

    c.value('.','VARCHAR(10)') AS List1

    FROM @xml.nodes('ROOT/ITEMS/LIST1/LIST1_ITEM') AS T(c)



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Just a question - do you get any better performance loading into a temp table?

    Atlantis Interactive - SQL Server Tools
    My blog[/url]
    Why I wrote a sql query analyzer clone

  • The use of a temp table vs. table variable seems to make no difference.

    So here is the code. At least this code will build an XML object that is very close to what we get in our stored procedure.

    -- Create temp version of table that exists in database. It has 50+ columns

    DECLARE @t TABLE (

    col_01 varchar(6), col_02 varchar(7), col_03 varchar(3), col_04 varchar(1), col_05 varchar(4),

    col_06 varchar(11), col_07 varchar(1), col_08 varchar(14), col_09 varchar(12), col_10 varchar(2),

    col_11 varchar(1), col_12 varchar(2), col_13 varchar(5), col_14 numeric(38, 0), col_15 numeric(38, 0),

    col_16 numeric(38, 0), col_17 numeric(38, 0), col_18 numeric(38, 0), col_19 varchar(5), col_20 varchar(3),

    col_21 varchar(5), col_22 varchar(10), col_23 varchar(10), col_24 varchar(12), col_25 datetime,

    col_26 datetime, col_27 varchar(4), col_28 varchar(4), col_29 varchar(4), col_30 varchar(4), col_31 varchar(4),

    col_32 varchar(21), col_33 varchar(11), col_34 varchar(6), col_35 varchar(2), col_36 varchar(6), col_37 varchar(5),

    col_38 varchar(1), col_39 varchar(6), col_40 varchar(7), col_41 varchar(11), col_42 varchar(2), col_43 varchar(3),

    col_44 varchar(1), col_45 varchar(12), col_46 varchar(3), col_47 varchar(3), col_48 varchar(5), col_49 varchar(17),

    col_50 varchar(25), col_51 numeric(38, 0), col_52 varchar(10), col_53 varchar(10), col_54 varchar(13)

    )

    -- Insert sample row of data. Most columns are not populated during the imports.

    INSERT @t (

    col_01,

    col_02,

    col_04,

    col_05,

    col_06,

    col_08,

    col_09,

    col_12,

    col_13,

    col_14,

    col_15,

    col_16,

    col_17,

    col_18,

    col_26,

    col_33,

    col_39,

    col_40,

    col_41,

    col_51 )

    SELECT

    'ABCDEF',

    '123456',

    'A',

    2590,

    '11089121',

    'ABCDEFGHIKLMNO',

    'PAD',

    'RC',

    7102,

    0,

    0,

    0,

    0,

    0,

    '2009-04-02 20:38:50.000',

    '11089121',

    'ABCDEF',

    '702396',

    '12064078',

    0

    -- Create XML object that has 33000 rows of the sample data. This is comparable in size to the document I import

    DECLARE @x xml

    SET @x = (

    SELECT *

    FROM (

    SELECT NULL AS x

    ) ItemList

    CROSS JOIN

    (SELECT t.*

    FROM @t t

    CROSS JOIN (SELECT TOP 33000 v1.* FROM master..spt_values v1 CROSS JOIN master..spt_values v2) v

    ) Item

    FOR XML AUTO

    )

    -- Run the code I was attempting to use in order to extract the data. It takes around 10-15 seconds to get everything back

    DECLARE

    @doc_id int,

    @time datetime

    SET @time = getdate()

    EXEC sp_xml_preparedocument @doc_id output, @x

    SELECT *

    FROM OPENXML(@doc_id, '//ItemList/Item')

    WITH (

    col_01 varchar(6), col_02 varchar(7), col_03 varchar(3), col_04 varchar(1), col_05 varchar(4),

    col_06 varchar(11), col_07 varchar(1), col_08 varchar(14), col_09 varchar(12), col_10 varchar(2),

    col_11 varchar(1), col_12 varchar(2), col_13 varchar(5), col_14 numeric(38, 0), col_15 numeric(38, 0),

    col_16 numeric(38, 0), col_17 numeric(38, 0), col_18 numeric(38, 0), col_19 varchar(5), col_20 varchar(3),

    col_21 varchar(5), col_22 varchar(10), col_23 varchar(10), col_24 varchar(12), col_25 datetime,

    col_26 datetime, col_27 varchar(4), col_28 varchar(4), col_29 varchar(4), col_30 varchar(4), col_31 varchar(4),

    col_32 varchar(21), col_33 varchar(11), col_34 varchar(6), col_35 varchar(2), col_36 varchar(6), col_37 varchar(5),

    col_38 varchar(1), col_39 varchar(6), col_40 varchar(7), col_41 varchar(11), col_42 varchar(2), col_43 varchar(3),

    col_44 varchar(1), col_45 varchar(12), col_46 varchar(3), col_47 varchar(3), col_48 varchar(5), col_49 varchar(17),

    col_50 varchar(25), col_51 numeric(38, 0), col_52 varchar(10), col_53 varchar(10), col_54 varchar(13)

    )

    EXEC sp_xml_removedocument @doc_id

    SELECT datediff(ss, @time, getdate()) AS TimeElapsed

    And of course my original question...

    1) Is this a normal span of time for this amount of data?

    2) And...is there something that can be done to improve performance?

  • One thing to start with:

    Thank you for the effort crating the test data!!

    It made it really easy to work on and to have something to compare to! 🙂

    I've tried three version to get the identical result:

    1) temp table and index on xml column together with XQuery

    2) Xquery and the xml variable

    3) test code you provided

    What I found is that using a temp table with XML index takes 2/3 of the original time, wherein populating the temp table and the xml index takes almost as long as to query the data.

    Since you're running multiple requests against the xml data I'd try both versions (temp table and XQuery) and see which one performs best. Please note that you should define the xml index after the INSERT statement. Otherwise performance will slow down by another 3 sec (at least using my system).

    The index itself needs to be created just once. After that you can run your multiple queries.

    If you don't mind sharing your test results and the way you finally decided to go I'd really appreciate the feedback.

    -- solution 1: using indexed temp table

    SET @time = getdate()

    CREATE TABLE #xml (id INT identity(1,1),data xml

    CONSTRAINT [PK_#xml_id] PRIMARY KEY CLUSTERED (id ASC))

    INSERT INTO #xml(data)

    SELECT @x

    CREATE PRIMARY XML INDEX Idx_#xml_data

    ON #xml (data);

    SELECT 'populate temp Table', datediff(ss, @time, getdate()) AS TimeElapsed, datediff(ms, @time, getdate()) AS TimeElapsed

    SET @time = getdate()

    SELECT

    c.value('@col_01','varchar(30)') AS col_01,

    c.value('@col_02','varchar(30)') AS col_02,

    c.value('@col_04','varchar(30)') AS col_04,

    c.value('@col_05','varchar(30)') AS col_05,

    c.value('@col_06','varchar(30)') AS col_06,

    c.value('@col_08','varchar(30)') AS col_08,

    c.value('@col_09','varchar(30)') AS col_09,

    c.value('@col_12','varchar(30)') AS col_12,

    c.value('@col_13','varchar(30)') AS col_13,

    c.value('@col_14','varchar(30)') AS col_14,

    c.value('@col_15','varchar(30)') AS col_15,

    c.value('@col_16','varchar(30)') AS col_16,

    c.value('@col_17','varchar(30)') AS col_17,

    c.value('@col_18','varchar(30)') AS col_18,

    c.value('@col_26','varchar(30)') AS col_26,

    c.value('@col_33','varchar(30)') AS col_33,

    c.value('@col_39','varchar(30)') AS col_39,

    c.value('@col_40','varchar(30)') AS col_40,

    c.value('@col_41','varchar(30)') AS col_41,

    c.value('@col_51','varchar(30)') AS col_51

    INTO #t

    FROM

    #xml

    CROSS APPLY data.nodes('ItemList/Item') T(c)

    SELECT 'temp Table', datediff(ss, @time, getdate()) AS TimeElapsed, datediff(ms, @time, getdate()) AS TimeElapsed

    SET @time = getdate()

    SELECT

    c.value('@col_01','varchar(30)') AS col_01,

    c.value('@col_02','varchar(30)') AS col_02,

    c.value('@col_04','varchar(30)') AS col_04,

    c.value('@col_05','varchar(30)') AS col_05,

    c.value('@col_06','varchar(30)') AS col_06,

    c.value('@col_08','varchar(30)') AS col_08,

    c.value('@col_09','varchar(30)') AS col_09,

    c.value('@col_12','varchar(30)') AS col_12,

    c.value('@col_13','varchar(30)') AS col_13,

    c.value('@col_14','varchar(30)') AS col_14,

    c.value('@col_15','varchar(30)') AS col_15,

    c.value('@col_16','varchar(30)') AS col_16,

    c.value('@col_17','varchar(30)') AS col_17,

    c.value('@col_18','varchar(30)') AS col_18,

    c.value('@col_26','varchar(30)') AS col_26,

    c.value('@col_33','varchar(30)') AS col_33,

    c.value('@col_39','varchar(30)') AS col_39,

    c.value('@col_40','varchar(30)') AS col_40,

    c.value('@col_41','varchar(30)') AS col_41,

    c.value('@col_51','varchar(30)') AS col_51

    INTO #t2

    FROM

    @x.nodes('ItemList/Item') T(c)

    SELECT 'xml variable XQuery', datediff(ss, @time, getdate()) AS TimeElapsed, datediff(ms, @time, getdate()) AS TimeElapsed

    SET @time = getdate()

    -- solution 3: using xml variable with OPENXML (original approach)

    /*results

    StepTime[sec]Time[msec]

    fill temp Table44076

    XQuery & temp Table65280

    XQuery & xml variable1010233

    OPENXML& xml variable 1615653

    */



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Once I saw your code, I finally understood what you meant by temp tables and an index on the XML column. Thanks!

    Here is the problem. Once you add all the columns possible for the table, the performance degrades fast.

    I modified your solution #1 to include every possible attribute within the XML file:

    -- solution 1: using indexed temp table

    SET @time = getdate()

    CREATE TABLE #xml (id INT identity(1,1),data xml

    CONSTRAINT [PK_#xml_id] PRIMARY KEY CLUSTERED (id ASC))

    INSERT INTO #xml(data)

    SELECT @x

    CREATE PRIMARY XML INDEX Idx_#xml_data

    ON #xml (data);

    SELECT 'populate temp Table', datediff(ss, @time, getdate()) AS TimeElapsed, datediff(ms, @time, getdate()) AS TimeElapsed

    SET @time = getdate()

    SELECT

    c.value('@col_01', 'varchar(6)') AS col_01,

    c.value('@col_02', 'varchar(7)') AS col_02,

    c.value('@col_03', 'varchar(3)') AS col_03,

    c.value('@col_04', 'varchar(1)') AS col_04,

    c.value('@col_05', 'varchar(4)') AS col_05,

    c.value('@col_06', 'varchar(11)') AS col_06,

    c.value('@col_07', 'varchar(1)') AS col_07,

    c.value('@col_08', 'varchar(14)') AS col_08,

    c.value('@col_09', 'varchar(12)') AS col_09,

    c.value('@col_10', 'varchar(2)') AS col_10,

    c.value('@col_11', 'varchar(1)') AS col_11,

    c.value('@col_12', 'varchar(2)') AS col_12,

    c.value('@col_13', 'varchar(5)') AS col_13,

    c.value('@col_14', 'numeric(38, 0)') AS col_14,

    c.value('@col_15', 'numeric(38, 0)') AS col_15,

    c.value('@col_16', 'numeric(38, 0)') AS col_16,

    c.value('@col_17', 'numeric(38, 0)') AS col_17,

    c.value('@col_18', 'numeric(38, 0)') AS col_18,

    c.value('@col_19', 'varchar(5)') AS col_19,

    c.value('@col_20', 'varchar(3)') AS col_20,

    c.value('@col_21', 'varchar(5)') AS col_21,

    c.value('@col_22', 'varchar(10)') AS col_22,

    c.value('@col_23', 'varchar(10)') AS col_23,

    c.value('@col_24', 'varchar(12)') AS col_24,

    c.value('@col_25', 'datetime') AS col_25,

    c.value('@col_26', 'datetime') AS col_26,

    c.value('@col_27', 'varchar(4)') AS col_27,

    c.value('@col_28', 'varchar(4)') AS col_28,

    c.value('@col_29', 'varchar(4)') AS col_29,

    c.value('@col_30', 'varchar(4)') AS col_30,

    c.value('@col_31', 'varchar(4)') AS col_31,

    c.value('@col_32', 'varchar(21)') AS col_32,

    c.value('@col_33', 'varchar(11)') AS col_33,

    c.value('@col_34', 'varchar(6)') AS col_34,

    c.value('@col_35', 'varchar(2)') AS col_35,

    c.value('@col_36', 'varchar(6)') AS col_36,

    c.value('@col_37', 'varchar(5)') AS col_37,

    c.value('@col_38', 'varchar(1)') AS col_38,

    c.value('@col_39', 'varchar(6)') AS col_39,

    c.value('@col_40', 'varchar(7)') AS col_40,

    c.value('@col_41', 'varchar(11)') AS col_41,

    c.value('@col_42', 'varchar(2)') AS col_42,

    c.value('@col_43', 'varchar(3)') AS col_43,

    c.value('@col_44', 'varchar(1)') AS col_44,

    c.value('@col_45', 'varchar(12)') AS col_45,

    c.value('@col_46', 'varchar(3)') AS col_46,

    c.value('@col_47', 'varchar(3)') AS col_47,

    c.value('@col_48', 'varchar(5)') AS col_48,

    c.value('@col_49', 'varchar(17)') AS col_49,

    c.value('@col_50', 'varchar(25)') AS col_50,

    c.value('@col_51', 'numeric(38, 0)') AS col_51,

    c.value('@col_52', 'varchar(10)') AS col_52,

    c.value('@col_53', 'varchar(10)') AS col_53,

    c.value('@col_54', 'varchar(13)') AS col_543

    INTO #t

    FROM

    #xml

    CROSS APPLY data.nodes('ItemList/Item') T(c)

    SELECT 'temp Table', datediff(ss, @time, getdate()) AS TimeElapsed, datediff(ms, @time, getdate()) AS TimeElapsed

    SET @time = getdate()

    On my machine I got the following results:

    TimeElapsed TimeElapsed

    ------------------- ----------- -----------

    populate temp Table 4 4016

    TimeElapsed TimeElapsed

    ---------- ----------- -----------

    temp Table 67 66813

    Everything said I've read suggests I need to create an XML SCHEMA COLLECTION object within the database in order to overcome this (at least when using the nodes function).

  • It looks like the number of attributes causes the long import time.

    Would there be any benefit to change those attributes to elements for testing purpose, meaning: would you have a chance to change the xml structure if it would help performance?



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Unfortunately our team doesn't own the structures and the idea was to make as few changes in the non-database side of the software.

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply