April 23, 2010 at 7:38 am
I have a table in my database that is very large. It is basically a table of XML documents. The table contains a field with an XML data type that can average around 1mb of data for each record. My table currently has over a million records, and it is growing quickly. The problem I'm having is that I need to run queries that return values from specific nodes within the xml documents. When ever I do this the query can take a very long time. I cannot index the xml because there are many different types of documents so there are 100+ different schemas. Is there any way to speed this up? I've tried to create XML indexes on views that filter out specific schemas, but it doesn't look like that is possible. Does anyone have any suggestions on how to speed these queries up? Here is an example of a simplified query that would be slow. The queries I am actually using are much more complex.
WITH XMLNAMESPACES ('http://schemas.somedomain.com/infopath/2003/appname' AS my)
SELECT
DocumentXML.value('(/my:myFields/my:PECode)[1]','varchar(20)') AS PECode
FROM
dbo.Documents
WHERE
DocumentType = 236 AND
DocumentXML.value('(/my:myFields/my:SelectedProductID)[1]','int') = 100;
April 23, 2010 at 11:47 am
Would you mind providing a short example of one of your xml file so we have something to test our solution against? Would help (at least me...)
April 23, 2010 at 12:28 pm
Here's a stripped down version:
DECLARE @tbl AS TABLE (DocumentType INT, DocumentXML XML);
DECLARE @XML XML;
SELECT @XML = '
<my:myFields xmlns:my="http://schemas.somedomain.com/infopath/2003/appname" xml:lang="en-us">
<my:SelectedProductID>100</my:SelectedProductID>
<my:PECode>4.2.A</my:PECode>
</my:myFields>'
INSERT INTO @tbl (DocumentType, DocumentXML)
VALUES (236, @XML);
WITH XMLNAMESPACES ('http://schemas.somedomain.com/infopath/2003/appname' AS my)
SELECT
DocumentXML.value('(/my:myFields/my:PECode)[1]','varchar(20)') AS PECode
FROM
@tbl
WHERE
DocumentType = 236 AND
DocumentXML.value('(/my:myFields/my:SelectedProductID)[1]','int') = 100;
April 23, 2010 at 1:11 pm
I'm not sure if this will help performance... but it's worth a try 😉
DECLARE @var INT
SET @var = 100;
WITH XMLNAMESPACES ('http://schemas.somedomain.com/infopath/2003/appname' AS my)
SELECTc.value('../my:PECode[1]','varchar(20)') AS PECode
FROM@tbl tbl
CROSS APPLY
tbl.DocumentXML.nodes('/my:myFields/my:SelectedProductID[text()=sql:variable("@var")]') T(c)
WHERE
DocumentType = 236
Edit: Ooops!! Forgot to put the WHERE clause back in... Sorry!!
April 23, 2010 at 1:41 pm
Is this thread related to your other recent one by any chance? 😉
April 23, 2010 at 2:02 pm
I just tested out your method with a larger set of data and there is no significant speed improvement. Any other ideas?
..
Why yes, they are related. These queries do tend to bog the server down. I'm hoping I can get this one resolved so that the other isn't an issue anymore... but if not - then I'll need to do something about the performance hits.
April 23, 2010 at 2:51 pm
Would it be possible for you to shred the xml files into relational table(s) (including proper indexing)? Maybe just with holding the values queried frequently?
A question on top: Do you use any kind of loops/c.u.r.s.o.r. to query the data? (just wondering why you have a defined value for "SelectedProductID")...
Edit: reason for asking to shred the xml data: Since you cannot add an XML index you won't be able to benefit from the SQL Server internal optimization (which is, afaik, pretty much the same concept like the shredded data set).
April 23, 2010 at 7:58 pm
lmu92 (4/23/2010)
Would it be possible for you to shred the xml files into relational table(s) (including proper indexing)? Maybe just with holding the values queried frequently?A question on top: Do you use any kind of loops/c.u.r.s.o.r. to query the data? (just wondering why you have a defined value for "SelectedProductID")...
Edit: reason for asking to shred the xml data: Since you cannot add an XML index you won't be able to benefit from the SQL Server internal optimization (which is, afaik, pretty much the same concept like the shredded data set).
I agree. IMHO, storing raw XML is as bad or worse than storing CSV's. Both are denormalized... both require special handling... neither can be optimized.
I believe that you'll also find that, once properly normalized, the space requirements will be about 1/16th of what you currently have.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 24, 2010 at 2:29 am
Jeff Moden (4/23/2010)
...
I agree. IMHO, storing raw XML is as bad or worse than storing CSV's. Both are denormalized... both require special handling... neither can be optimized.
I believe that you'll also find that, once properly normalized, the space requirements will be about 1/16th of what you currently have.
I'd like to (partially) disagree throwing CSV's and XML in the same pot. Especially in terms of optimization: if you can index an XML column, the index itself is stored in a shredded structure (IIRC, but I can't find the source right now). So it is possible to optimize XML. Not to the degree like it would be possible with relational data but better than plain CSV's anyway.
April 24, 2010 at 9:38 am
lmu92 (4/24/2010)
Jeff Moden (4/23/2010)
...
I agree. IMHO, storing raw XML is as bad or worse than storing CSV's. Both are denormalized... both require special handling... neither can be optimized.
I believe that you'll also find that, once properly normalized, the space requirements will be about 1/16th of what you currently have.
I'd like to (partially) disagree throwing CSV's and XML in the same pot. Especially in terms of optimization: if you can index an XML column, the index itself is stored in a shredded structure (IIRC, but I can't find the source right now). So it is possible to optimize XML. Not to the degree like it would be possible with relational data but better than plain CSV's anyway.
Heh... I'll take that correction... thanks, Lutz. I should have said that both XML and CSV's are a form of denormalization that should be avoided in favor of proper data structure within the database even if XML can be "indexed". Considering the label bloat in XML, I don't consider XML to be an improvement over CSV's for the transmission of data, either.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 24, 2010 at 10:20 am
Jeff Moden (4/24/2010)
...
Heh... I'll take that correction... thanks, Lutz. I should have said that both XML and CSV's are a form of denormalization that should be avoided in favor of proper data structure within the database even if XML can be "indexed". Considering the label bloat in XML, I don't consider XML to be an improvement over CSV's for the transmission of data, either.
Now that we are able to use table variables as a parameter for stored procedures, I agree.
Here's a scenario where I preferred XML over a staging table to be filled by a 3rd party app: we're "forced" to use a middleware to communicate with a DB2 database.
Those "man-in-the middle" would charge two project days if we would like to add a column or even just change the datatype of a column (happens quite frequently...).
So we (the DB2 folks and our group) agreed to transfer the data in xml format into a single column. The middleware folks were kinda upset about it. But transferring XML data was part of the contract 😀
(side note: transfer rate once or twice a day, data volume 2MB per xml file).
April 24, 2010 at 1:43 pm
BWAA-HAA!!!!... you're talking to the wrong guy about XML. There's not much in my life that I take the time to hate but I hate XML. It's hierarchical in nature (even when there's no hierarchy to be had) and SQL Server is not. It has two labels for every element which typically bloats integer data by a factor of 16. Yeah... that's what I want to do... let me FTP 16 GB of data, store it, back it up, shred it and finally use it instead of FTPing just 1 GB, storing just 1GB, backing up just 1GB, and doing a very high speed import of flat data into properly normalized tables.
XML is tough on the "pipe", tough on storage, tough on backups, and tough on the server. I'd much rather work with normalized data even if it took 10 files to just 1 XML file with hierarchical data in it.
It also strikes me as very odd that some people who love XML hate EDI. One is as bad as the other because they're both hierarchical in nature... and relational databases are not. 😉
Guess I'm going to have to do what I said when XML first came out and finally write "JML". 😛
--Jeff Moden
Change is inevitable... Change for the better is not.
April 24, 2010 at 3:46 pm
@jeff: it's not that I'm actually advertising XML, I'm sure not. 😉
I just had to get used to it in order to get those "man in the middle" out of our way. So, it's more like " my enemy's enemy is my friend" 😀
When it comes down to choose between XML and a signifcant project delay including ridiculous costs, what do you?...
April 24, 2010 at 5:45 pm
lmu92 (4/24/2010)
@Jeff: it's not that I'm actually advertising XML, I'm sure not. 😉I just had to get used to it in order to get those "man in the middle" out of our way. So, it's more like " my enemy's enemy is my friend" 😀
When it comes down to choose between XML and a signifcant project delay including ridiculous costs, what do you?...
Sorry Lutz... I was just venting. I appreciate the position you were in and may have done the same. Of course, the first thing I would have done is make a personal visit to the data provider with a bucket of pork chops to see if I could get them to provide the data in a more reasonable format first (they normally cheerfully make the change because they usually don't like XML either and sometimes I have to show them how to do it) but, if unsuccessful, I'd have done what you did.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 14 posts - 1 through 13 (of 13 total)
You must be logged in to reply to this topic. Login to reply