April 22, 2010 at 10:25 am
Hello everyone,
I've recently started work in a new position and find myself dealing with a data warehouse that has organically grown over the last few years with a number of developers having a go a various parts of the system. In general, all the code is stable and performs well, but we still get some failures that annoy us somewhat.
Recently one of our procedures stalled and eventually we had to kill the process and reboot the server to get it working correctly again. I took a look at the procedure that had been running and found the code below was where the process seemed to hang. There was no error, just complete lack of response. Once the server had been restarted, the process was run again and completed within a reasonable time (it's part of another process, so exact timings are not available).
I'm not sure where the problem actually lies, but the tables below are large. The "large_table" has about 11 million records and the "huge_table" has about 20 million. The "small_table" is around 1000 rows.
What I really want to know is if it may be more efficient/less prone to failure if I change this code to use a persistant table rather than a temporary one? I'm no expert, but creating a clustered index on a temporary table of around 11 million rows isn't a good idea, or is it?
Please tell me if you think I'm barking up the wrong flag pole and, of course, if I've posted this in the wrong forum.
Thanks in advance,
SELECT ht.a_id
INTO #massive_temporary_table
FROM large_table AS lt
INNER JOIN huge_table AS ht
ON lt.a_id = ht.a_id
INNER JOIN dbo.small_table AS st
ON lt.code = st.code
WHERE ((LTRIM(RTRIM(lt.name)) = LTRIM(RTRIM(st.word))
AND st.match = 'Y')
OR (LTRIM(RTRIM(lt.name)) LIKE LTRIM(st.word) + ' %'
AND st.s_match = 'Y')
OR (LTRIM(RTRIM(lt.name)) LIKE '% ' + RTRIM(st.word)
AND st.e_match = 'Y')
OR (LTRIM(rtrim(lt.name)) LIKE '% ' + LTRIM(RTRIM(st.word)) + ' %'
AND st.c_match = 'Y'))
GROUP BY ht.a_id
CREATE CLUSTERED INDEX idx_Tmp ON #massive_temporary_table (a_id)
April 22, 2010 at 1:07 pm
I think the first thing you should check if you have not already is where is your tempdb located and what is the size of it? If your tempdb fills up then you certainly run into a situation where SQL will stop responding. Note that the default is to place the tempdb on the C: drive so this may need to be moved to a larger drive to accommidate the larger size that may be required.
With that being said I think it is a bad idea to ever place that large of a dataset into a temp table although I suspect you will get some varying oppinoins on this. I prefer to place the data into a database defined as a stageing table that I can have indexes defined.
Dan
If only I could snap my figures and have all the correct indexes apear and the buffer clean and.... Start day dream here.
April 22, 2010 at 1:16 pm
Temp table or not, this code:
((LTRIM(RTRIM(lt.name))
Is going to kill your performance. It's going to prevent any indexes on the tables from being used and you'll see nothing but table scans. That's the very first thing I'd look at addressing.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
April 22, 2010 at 1:51 pm
The LIKEs will also kill performance because they have a wwildcard on each side. It's not the temp table that's your problem.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 22, 2010 at 1:57 pm
In fact, the LIKE's are a "hack" for doing a lookup on space delimited data. If you want this to fly, that space delimited data should probably be normalized. Since that data is a piddly 10k rows, it could be normallized very easily.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 22, 2010 at 2:02 pm
With all respect
How would writing 20 Million rows to a temp table not be part of the problem even with the for-mentioned query problems. I would be very surprised if this is not taking an extrodicary amount of time to do the write. If the tempd is sitting on drive C: and unless drive C: has a tremendous amount of space then this to me makes writing that many rows to a temp table a pretty bad idea.
Dan
If only I could snap my figures and have all the correct indexes apear and the buffer clean and.... Start day dream here.
April 22, 2010 at 2:41 pm
Dan.Humphries (4/22/2010)
With all respectHow would writing 20 Million rows to a temp table not be part of the problem even with the for-mentioned query problems. I would be very surprised if this is not taking an extrodicary amount of time to do the write. If the tempd is sitting on drive C: and unless drive C: has a tremendous amount of space then this to me makes writing that many rows to a temp table a pretty bad idea.
It wouldn't be any less of a problem writing to a permanent table in a permanent database. In fact, if the recovery model of that DB isn't SIMPLE, it could end up being a much larger problem
The problem with time is because the WHERE clauses are not sargeable.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 22, 2010 at 3:12 pm
Here's an example of why I say it's not the 20 million rows nor TempDB that's the problem. The following code generates 20 million orws of ID's just like the OPs code and it only takes 00:02:31 on a small 4 cpu server.
SELECT TOP 20000000
t1.Object_ID
INTO #JBMTest
FROM Master.sys.All_Columns t1,
Master.sys.All_Columns t2,
Master.sys.All_Columns t3
--Jeff Moden
Change is inevitable... Change for the better is not.
April 22, 2010 at 3:25 pm
I see what you are saying but surely a select of 20 Mil rows from a properly indexed table is not the same as writing 20 million rows to a temp table.
For example many of our database when I first started had the default configuration of holding the tempdb on the C: drive. We use a San here so our basic configuration does not alocate a massive portion to the c:drive. Now if you insert 20 million rows into the tempdb I would expect the database to be quite large. Now if the configuration is like ours was were the C: drive is being used then a 20 Gig tempdb could be enough to limit processing speed of the entire server. paige files are reduced virtual memory is reduced and the entire server would ultimatley slow down. does the risk of filling up the C: drive with the temdb alone not create enough of a red flag? is paiging not also handled a little differently with the tempdb?
Dan
If only I could snap my figures and have all the correct indexes apear and the buffer clean and.... Start day dream here.
April 22, 2010 at 3:32 pm
Another though would be that since the original problem states that the have to reboot would they not see a huge change in the amount of time it takes to run since the tempdb will have to autoexpand frequently during the write process?
Dan
If only I could snap my figures and have all the correct indexes apear and the buffer clean and.... Start day dream here.
April 22, 2010 at 4:18 pm
Dan.Humphries (4/22/2010)
With all respectHow would writing 20 Million rows to a temp table not be part of the problem even with the for-mentioned query problems. I would be very surprised if this is not taking an extrodicary amount of time to do the write. If the tempd is sitting on drive C: and unless drive C: has a tremendous amount of space then this to me makes writing that many rows to a temp table a pretty bad idea.
Of course that's an issue. But re-reading the question again, in case I missed something (again), the OP mentions that they have a 20million row table that's joining to others and if you look at the where clause, I'm assuming, probably accurately, that we're not writing all 20 million rows to the temp table, but some sub-set, possibly even a reasonable sub-set that if properly indexed and queried will perform just fine with the temp table. I sure could be wrong with that assumption, but without further evidence, it's more likely than that all 20 million rows are being transfered, because, I'll tell you this, all 20 million rows are being scanned, guaranteed.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
April 23, 2010 at 4:10 am
Wow!
Thanks for all the comments, much of what everyone has said has confirmed my suspicions. Basically I'm going to try a few tests with some large tables and see if I can improve performance and prevent the "hanging" situation by improving the select statement and possibly using a persistant table for the staging.
In answer to some of the queries that have arisen:
Yes, the where clause in the select statement is pretty nasty and is something that I'm going to investigate. The clause looks for words that appear in an address somewhere, so is always going to be a bit awkward to code (I was thinking of splitting each address up into its components and then running the comparison on each word thus found). The 10k row table is already normalised, it's the 20 million row one that's not :-).
Yes, there can be up to 20 million rows selected into the temporary table and almost always there are over 10 million.
No, I don't know where our tempdb is, but I'm going to ask our DBA at our next design meeting and see if we have a space problem there.
Yes, we are running out of space on some of our servers and we are working on reconfiguring our SAN to free up some more. Thus we may be running low on the tempdb too.
Thanks again for all the input, I may well re-post once I've done some tests so we can all benefit from what I find.
Thanks,
Andrew
April 23, 2010 at 6:18 am
ajrendall (4/23/2010)
Wow!Thanks for all the comments, much of what everyone has said has confirmed my suspicions. Basically I'm going to try a few tests with some large tables and see if I can improve performance and prevent the "hanging" situation by improving the select statement and possibly using a persistant table for the staging.
In answer to some of the queries that have arisen:
Yes, the where clause in the select statement is pretty nasty and is something that I'm going to investigate. The clause looks for words that appear in an address somewhere, so is always going to be a bit awkward to code (I was thinking of splitting each address up into its components and then running the comparison on each word thus found). The 10k row table is already normalised, it's the 20 million row one that's not :-).
Yes, there can be up to 20 million rows selected into the temporary table and almost always there are over 10 million.
No, I don't know where our tempdb is, but I'm going to ask our DBA at our next design meeting and see if we have a space problem there.
Yes, we are running out of space on some of our servers and we are working on reconfiguring our SAN to free up some more. Thus we may be running low on the tempdb too.
Thanks again for all the input, I may well re-post once I've done some tests so we can all benefit from what I find.
Thanks,
Andrew
Ah well, so much for my assumption. You were right Dan.
Why on earth would you be moving 10 million rows around? It sounds like the process needs to be reexamined more than the query.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
April 23, 2010 at 9:14 am
Dan.Humphries (4/22/2010)
I see what you are saying but surely a select of 20 Mil rows from a properly indexed table is not the same as writing 20 million rows to a temp table.For example many of our database when I first started had the default configuration of holding the tempdb on the C: drive. We use a San here so our basic configuration does not alocate a massive portion to the c:drive. Now if you insert 20 million rows into the tempdb I would expect the database to be quite large. Now if the configuration is like ours was were the C: drive is being used then a 20 Gig tempdb could be enough to limit processing speed of the entire server. paige files are reduced virtual memory is reduced and the entire server would ultimatley slow down. does the risk of filling up the C: drive with the temdb alone not create enough of a red flag? is paiging not also handled a little differently with the tempdb?
Actually, I don't have a problem with any of that but those are all good points that folks need to consider. What I am concerned about is also what you said in one of your posts on this thread and it begs to be answered... WHY??? Why does this need to be done? What ARE the business requirements that requires the movement of 20 million ID's from a single table to TempDB? Like you, I'm thinking that this should be a one time run where the original table has a flag column that should be marked.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 23, 2010 at 9:16 am
Why on earth would you be moving 10 million rows around? It sounds like the process needs to be reexamined more than the query.
Don't even ask ...
I've just started in this position and one of my jobs will be ... well, I think you get the idea ... 😉
Viewing 15 posts - 1 through 15 (of 18 total)
You must be logged in to reply to this topic. Login to reply