Speed improvement help needed for large xml queries

Question

Speed improvement help needed for large xml queries

J Vanderhorst

SSC Rookie

Points: 42
More actions
April 23, 2010 at 7:38 am

#99137

I have a table in my database that is very large. It is basically a table of XML documents. The table contains a field with an XML data type that can average around 1mb of data for each record. My table currently has over a million records, and it is growing quickly. The problem I'm having is that I need to run queries that return values from specific nodes within the xml documents. When ever I do this the query can take a very long time. I cannot index the xml because there are many different types of documents so there are 100+ different schemas. Is there any way to speed this up? I've tried to create XML indexes on views that filter out specific schemas, but it doesn't look like that is possible. Does anyone have any suggestions on how to speed these queries up? Here is an example of a simplified query that would be slow. The queries I am actually using are much more complex.
WITH XMLNAMESPACES ('http://schemas.somedomain.com/infopath/2003/appname' AS my)
SELECT
DocumentXML.value('(/my:myFields/my:PECode)[1]','varchar(20)') AS PECode
FROM
dbo.Documents
WHERE
DocumentType = 236 AND
DocumentXML.value('(/my:myFields/my:SelectedProductID)[1]','int') = 100;

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply

LutzM SSC Guru Points: 107049 More actions · Answer 1

Would you mind providing a short example of one of your xml file so we have something to test our solution against? Would help (at least me...)

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

J Vanderhorst SSC Rookie Points: 42 More actions · Answer 2

Here's a stripped down version:

DECLARE @tbl AS TABLE (DocumentType INT, DocumentXML XML);

DECLARE @XML XML;

SELECT @XML = '

<my:myFields xmlns:my="http://schemas.somedomain.com/infopath/2003/appname" xml:lang="en-us">

<my:SelectedProductID>100</my:SelectedProductID>

<my:PECode>4.2.A</my:PECode>

</my:myFields>'

INSERT INTO @tbl (DocumentType, DocumentXML)

VALUES (236, @XML);

WITH XMLNAMESPACES ('http://schemas.somedomain.com/infopath/2003/appname' AS my)

SELECT

DocumentXML.value('(/my:myFields/my:PECode)[1]','varchar(20)') AS PECode

FROM

@tbl

WHERE

DocumentType = 236 AND

DocumentXML.value('(/my:myFields/my:SelectedProductID)[1]','int') = 100;

LutzM SSC Guru Points: 107049 More actions · Answer 3

I'm not sure if this will help performance... but it's worth a try 😉

DECLARE @var INT

SET @var = 100;

WITH XMLNAMESPACES ('http://schemas.somedomain.com/infopath/2003/appname' AS my)

SELECTc.value('../my:PECode[1]','varchar(20)') AS PECode

FROM@tbl tbl

CROSS APPLY

tbl.DocumentXML.nodes('/my:myFields/my:SelectedProductID[text()=sql:variable("@var")]') T(c)

WHERE

DocumentType = 236

Edit: Ooops!! Forgot to put the WHERE clause back in... Sorry!!

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

LutzM SSC Guru Points: 107049 More actions · Answer 4

Is this thread related to your other recent one by any chance? 😉

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

J Vanderhorst SSC Rookie Points: 42 More actions · Answer 5

I just tested out your method with a larger set of data and there is no significant speed improvement. Any other ideas?

..

Why yes, they are related. These queries do tend to bog the server down. I'm hoping I can get this one resolved so that the other isn't an issue anymore... but if not - then I'll need to do something about the performance hits.

LutzM SSC Guru Points: 107049 More actions · Answer 6

Would it be possible for you to shred the xml files into relational table(s) (including proper indexing)? Maybe just with holding the values queried frequently?

A question on top: Do you use any kind of loops/c.u.r.s.o.r. to query the data? (just wondering why you have a defined value for "SelectedProductID")...

Edit: reason for asking to shred the xml data: Since you cannot add an XML index you won't be able to benefit from the SQL Server internal optimization (which is, afaik, pretty much the same concept like the shredded data set).

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 7

lmu92 (4/23/2010)
Would it be possible for you to shred the xml files into relational table(s) (including proper indexing)? Maybe just with holding the values queried frequently?
A question on top: Do you use any kind of loops/c.u.r.s.o.r. to query the data? (just wondering why you have a defined value for "SelectedProductID")...
Edit: reason for asking to shred the xml data: Since you cannot add an XML index you won't be able to benefit from the SQL Server internal optimization (which is, afaik, pretty much the same concept like the shredded data set).

I agree. IMHO, storing raw XML is as bad or worse than storing CSV's. Both are denormalized... both require special handling... neither can be optimized.

I believe that you'll also find that, once properly normalized, the space requirements will be about 1/16th of what you currently have.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

LutzM SSC Guru Points: 107049 More actions · Answer 8

Jeff Moden (4/23/2010)
...
I agree. IMHO, storing raw XML is as bad or worse than storing CSV's. Both are denormalized... both require special handling... neither can be optimized.
I believe that you'll also find that, once properly normalized, the space requirements will be about 1/16th of what you currently have.

I'd like to (partially) disagree throwing CSV's and XML in the same pot. Especially in terms of optimization: if you can index an XML column, the index itself is stored in a shredded structure (IIRC, but I can't find the source right now). So it is possible to optimize XML. Not to the degree like it would be possible with relational data but better than plain CSV's anyway.

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 9

lmu92 (4/24/2010)
Jeff Moden (4/23/2010)
...
I agree. IMHO, storing raw XML is as bad or worse than storing CSV's. Both are denormalized... both require special handling... neither can be optimized.
I believe that you'll also find that, once properly normalized, the space requirements will be about 1/16th of what you currently have.
I'd like to (partially) disagree throwing CSV's and XML in the same pot. Especially in terms of optimization: if you can index an XML column, the index itself is stored in a shredded structure (IIRC, but I can't find the source right now). So it is possible to optimize XML. Not to the degree like it would be possible with relational data but better than plain CSV's anyway.

Heh... I'll take that correction... thanks, Lutz. I should have said that both XML and CSV's are a form of denormalization that should be avoided in favor of proper data structure within the database even if XML can be "indexed". Considering the label bloat in XML, I don't consider XML to be an improvement over CSV's for the transmission of data, either.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

LutzM SSC Guru Points: 107049 More actions · Answer 10

Jeff Moden (4/24/2010)
...
Heh... I'll take that correction... thanks, Lutz. I should have said that both XML and CSV's are a form of denormalization that should be avoided in favor of proper data structure within the database even if XML can be "indexed". Considering the label bloat in XML, I don't consider XML to be an improvement over CSV's for the transmission of data, either.

Now that we are able to use table variables as a parameter for stored procedures, I agree.

Here's a scenario where I preferred XML over a staging table to be filled by a 3rd party app: we're "forced" to use a middleware to communicate with a DB2 database.

Those "man-in-the middle" would charge two project days if we would like to add a column or even just change the datatype of a column (happens quite frequently...).

So we (the DB2 folks and our group) agreed to transfer the data in xml format into a single column. The middleware folks were kinda upset about it. But transferring XML data was part of the contract 😀

(side note: transfer rate once or twice a day, data volume 2MB per xml file).

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 11

BWAA-HAA!!!!... you're talking to the wrong guy about XML. There's not much in my life that I take the time to hate but I hate XML. It's hierarchical in nature (even when there's no hierarchy to be had) and SQL Server is not. It has two labels for every element which typically bloats integer data by a factor of 16. Yeah... that's what I want to do... let me FTP 16 GB of data, store it, back it up, shred it and finally use it instead of FTPing just 1 GB, storing just 1GB, backing up just 1GB, and doing a very high speed import of flat data into properly normalized tables.

XML is tough on the "pipe", tough on storage, tough on backups, and tough on the server. I'd much rather work with normalized data even if it took 10 files to just 1 XML file with hierarchical data in it.

It also strikes me as very odd that some people who love XML hate EDI. One is as bad as the other because they're both hierarchical in nature... and relational databases are not. 😉

Guess I'm going to have to do what I said when XML first came out and finally write "JML". 😛

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

LutzM SSC Guru Points: 107049 More actions · Answer 12

@jeff: it's not that I'm actually advertising XML, I'm sure not. 😉

I just had to get used to it in order to get those "man in the middle" out of our way. So, it's more like " my enemy's enemy is my friend" 😀

When it comes down to choose between XML and a signifcant project delay including ridiculous costs, what do you?...

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 13

lmu92 (4/24/2010)
@Jeff: it's not that I'm actually advertising XML, I'm sure not. 😉
I just had to get used to it in order to get those "man in the middle" out of our way. So, it's more like " my enemy's enemy is my friend" 😀
When it comes down to choose between XML and a signifcant project delay including ridiculous costs, what do you?...

Sorry Lutz... I was just venting. I appreciate the position you were in and may have done the same. Of course, the first thing I would have done is make a personal visit to the data provider with a bucket of pork chops to see if I could get them to provide the data in a more reasonable format first (they normally cheerfully make the change because they usually don't like XML either and sometimes I have to show them how to do it) but, if unsuccessful, I'd have done what you did.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)