June 20, 2006 at 3:06 pm
I have a simple table with 500K+ rows and two columns: id (int, indexed) and clob (ntext). I need to retrieve top 1000 rows matching the list of row id's that are stored somewhere else, typically ranging from 1000 to 10,000 id's per query.
Problem:
SELECT clob FROM table WERE id in (...)
takes forever to execute.
Alternatively, inserting thousands of id's into a global temporary table and then doing a join works much faster, but still takes anywhere between 8 seconds and 1 minute, depending on the total id count:
create table ##TMP (id int);
while ... begin
insert into ##TMP values (...);
end
select top 1000 table1.clob
from table1
inner join ##TMP on ##TMP.id = table1.id
Strangely enough, increasing the list size from 5,000 to 15,000 or even 20,000 id's seems to improve the overall performance in some cases.
Questions:
1. Does anybody have a suggestion on how can the lookup/retrieval time be improved in this particular scenario (down to 1-2 seconds)?
2. If not, can anybody suggest a better way to store millions of XML documents (updated daily) than to keep them in a SQL Server table?
3. Have anybody observed SQL Server 2005 performing better under a seemingly greater load (as described above)?
Setup: 2x3.0GHz with 3Gig RAM, Win Server 2003, SQL Server 2005
Thanks in advance.
-x
June 20, 2006 at 10:34 pm
1. Is there a pattern to the ID list, or is it just a random blast of IDs? If there's some method to the ID madness, then there should be a way to cook an extra column that you can cluster on instead of the ID, so the rows can be grabbed from sequential data pages. Or there could be a useful partiioning scheme you could derive (Enterprise Edition only).
Also, I would advise moving the SELECT TOP 1000 to the query building the join table instead of on the query against the big table. TOP X statements are applied after the query is processed; meaning the large table must be completely processed with the full list of IDs from the join table before the top 1000 rows will be determined. If you chop the small join table before you even begin the query against the large table, then less of the large data table will be searched for matches to the smaller list of IDs.
If you're using the TOP 1000 to control the amount of data returned, querying a list of, say, 10,000 IDs in 1,000-row batches, then create a second join table to join against the large table. For each batch of 1,000 rows, fill your second table with the 1,000 IDs you're looking for, and use the second table to pull data from the big one.
2. Why are you storing those XML documents as ntext instead of XML? If you're committed to not using the XML data type, why ntext instead of text?
3. For the performance improvements with larger joinsets, have you looked at the query plans? I'd guess you crossed the sweet spot in there somewhere where the optimizer determined a table scan was the best way to go.
5,000 rips through the index tree of a table with half a million rows piles on a lot of logical reads. If your index tree is 5 levels deep, that's 25,000 logical reads.
However, for a table scan... if you didn't specify Text In Row when you built the table, that makes your rows 20 bytes wide (4-byte integer + 16-byte pointer to the clob)... depending on your fillfactor, let's just call it about 250 rows per data page. Based on those (awfully loose) assuptions, a complete table scan would take about 2000 logical reads. If that ID number is an identity, you could safely use a 100 percent fillfactor, making those table scans really move.
-Eddie
Eddie Wuerch
MCM: SQL
June 21, 2006 at 11:00 am
June 22, 2006 at 2:12 am
Eddie,
thank you for prompt and detailed answer.
> 1. Is there a pattern to the ID list, or is it just a random blast of IDs?
Random.
> Also, I would advise moving the SELECT TOP 1000 to the query building the join table instead of on the query against the big table.
After making certain changes to the id lookup routine, the total number of id's has been dramatically decreased and is most likely to stay this way. So at this point we are only concentrating on the scenrio with just 1,000 rows in the join table.
>2. Why are you storing those XML documents as ntext instead of XML?
This is a major issue which is currently being addressed. One of the contemplated solutions is to store utf8-encoded xml data in an image type column.
> If you're committed to not using the XML data type, why ntext instead of text?
Are there any obvious advantages or downsides to using xml datatype over the image or text types?
> However, for a table scan... if you didn't specify Text In Row when you built the table
Table In Row is another option being considered. Are you saying that enabling this option may produce an opposite effect, or did I misinterpret your comment?
Thanks in advance.
-x
June 22, 2006 at 1:54 pm
>> 1. Is there a pattern to the ID list, or is it just a random blast of IDs?
> Random.
Shoot. Well, there goes that idea...
> Are there any obvious advantages or downsides to using xml datatype over the image or text types?
You can index, query, search, update, transform, etc. xml data using XQuery. If schema of these documents is similar (or those satisfied by particular queries are similar), and you were only interested in specific data points out of a much larger document, you could pluck out those values and return them as typed columns (int, varchar, etc.) instead of sending back the full xml document. Or, you could pull out just a specific fragment in a document, etc., etc. Basically, it turns the xml document from 'just these *&%^&^ things I store in the database' to living, breathing data. XQuery is pretty darn useful, IMHO.
I don't have the book with me that covered the topic, but I think I read that xml data stored in an xml column is partially or fully shredded and stored more efficiently than just a big string. If that sounds appealing, please do some digging to validate it before you take my word on it. I could be thinking of something else entirely.
> Are you saying that enabling this option may produce an opposite effect, or did I misinterpret your comment?
What the Text In Row option will do is attempt to store the text data in the data page with the rest of the row (in this case, just the ID column), instead of the 16-byte pointer. For smaller documents (less than 4,000 characters for ntext, less than 8,000 for text), that would save the trouble of fetching the document using the pointer. If the docs are generally bigger than that, then they won't fit in the data row and will be stored as text data outside the table, with a 16-byte pointer in the row that point to that location. The Text In Row setting, in that case, will change nothing.
As for the effects:
For smaller resultsets that can be efficiently retrieved with an index, having that text on the data page instead of somewhere else may save some reads fetching the data. If the rows being retreived are generally sitting next to each other due a clustered index that applies to most searches, then it becomes an even more effective strategy. A fair number of caveats, there.
However, adding loads more data to each row will require more space to store each row. That will slash the number of records that can fit on one data page, and balloon the number of data pages for the table. If your documents are between 2,000 and 4,000 characters (still assuming the ntext datatype you have now), and they get stored in the row, that leaves you with one row per data page, down from about 250 in my earlier example. A simple table scan then takes 500,000 logical reads (most from disk, that much data won't get cached in memory for subsequent reads), which will *really* drag compared to your prior performance. In your case, where tables scans can be the best choice to retreive your data, I'd gladly eat the cost of retreiving text data vs. adding that many pages to the scan.
So, IMHO, enabling Text in Row (in this case) would at best change nothing and at worst slow things down painfully.
-Eddie
Eddie Wuerch
MCM: SQL
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply