May 8, 2007 at 7:29 am
Other than moving this to a faster IO, or partioning it, or moving indexes to different spindle, anyone see a way to get a performance increase out of the following T-SQL? Primary clustered indexes on the WeblogID column and on the urlStemID, as well as a non-clustered one on the logWritten column. N.B. performance almost exactly the same to the second when using an inner join rather than a 'in' . Currently the DB size is 3.5GB - and about 13 million records in the weblog table - SQL server express (yes I know going to have to partition soon or upgrade since I will soon hit the 4GB limit but this is non production at the moment). According to the exceution plan the bulk of the cost is in the clustered index scan of the weblog table - and of that 85% is the IO. I guess I know the answer but was just wondering on any others thoughts to get this below the 22s it is taking to run for long date ranges i.e >6 months:
SELECT
Year(logWritten) as [Year], MONTH(logWritten) as [Month],Day(logWritten) as [Day of Month],count(WebLogID) as 'Visits' FROM Weblog
WHERE (urlStemID IN
(SELECT urlStemID FROM URLStem
WHERE (StemText LIKE @strSearch)))
AND (logWritten BETWEEN @statDate AND @endDate)
GROUP BY Year(logWritten),MONTH(logWritten),Day(logWritten)
ORDER BY Year(logWritten),MONTH(logWritten),Day(logWritten)
May 8, 2007 at 7:52 am
Clustered index must be on logWritten column.
And use join anyway.
_____________
Code for TallyGenerator
May 8, 2007 at 7:52 am
SELECT Year(logWritten) as [Year], MONTH(logWritten) as [Month],Day(logWritten) as [Day of Month],
count(WebLogID) as 'Visits'
FROM Weblog wl
join URLStem us
on wl.urlStemID = us.urlStemID
where StemText LIKE @strSearch
and logWritten BETWEEN @statDate AND @endDate
GROUP BY Year(logWritten),MONTH(logWritten),Day(logWritten)
ORDER BY Year(logWritten),MONTH(logWritten),Day(logWritten)
that should help a bit but hereare other things you could do.
Does log written have a time portion?
IS there a wild card at the of @strSearch, is it needed?
May 8, 2007 at 8:02 am
Nope that is is exactly the same performance (as stated in in my original post - had already tried this with joins) although I did run your version just in case!. Yes logwritten is a time stamp of type datetime, unfortuantly @strSearch is a variable string which changes depending upon what the user is looking for stats on. The clustered scan on the urlstem table looking for the string has a 0% cost though - it is the clustered scan of the weblog table ( cost:71%) and the hash match:- right semi join from my code, or inner join from your, (cost:26%) is where the delays are. Thanks for the reply
May 8, 2007 at 8:07 am
Indexes being rebuilt, with clustered on the time - will have performance info in a bit
May 8, 2007 at 4:31 pm
Still being rebuilt?
_____________
Code for TallyGenerator
May 8, 2007 at 5:24 pm
You'll never do better than an INDEX SCAN with all the queries written so far... not sure there's one better, either. Might be able to tweek a couple of things but I need to know what kind of values can be in @strSearch and StemText...
Also, please tell me that everything you have labeled as a date is a datetime datatype and everything you have labeled as an ID in an INT or BIGINT.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 8, 2007 at 6:02 pm
INDEX SCAN will be used every time you query for a range of values from a column not having clustered index on it.
BETWEEN means CLUSTERED INDEX.
End of story.
_____________
Code for TallyGenerator
May 9, 2007 at 12:41 am
Nah... you can get INDEX SEEKS out of non-clustered indexes... they just won't be CLUSTERED INDEX SEEKS. And, (as you know, ol' friend), I don't think they'll come close to an INDEX SEEK even with the correct index on the code, so far.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 9, 2007 at 8:05 am
If you just need to show on a day by day basis and do not need to break down into Year, Month, Day then this might shave a bit off, but not sure how much with 13 mil rows, only had 10000 available to test with. Also you'll note I wnet back to an inner join but I moved the variable for the URLStem StemText into the join out from the where. Sometimes this can impact the queries performance on multi join queries but might have some impact here.
SELECT
DATEADD(d,DATEDIFF(d,0,WB.logWritten),0) as [Date],
COUNT(WebLogID) as [Visits]
FROM
dbo.Weblog WB
INNER JOIN
dbo.URLStem URL
ON
WB.urlStemID = URL.urlStemID AND
URL.StemText LIKE @strSearch
WHERE
WB.logWritten BETWEEN @statDate AND @endDate
GROUP BY
DATEADD(d,DATEDIFF(d,0,WB.logWritten),0)
ORDER BY
DATEADD(d,DATEDIFF(d,0,WB.logWritten),0)
May 9, 2007 at 8:43 am
1) IN and JOIN run in same time because the optimizer is most likely rewriting the IN as a JOIN under the covers (show estimated query plan for both to verify this).
2) You are actually doing THREE things that make a clustered index on logwritten better for this particular query: BETWEEN, GROUP BY and ORDER BY. If this is the main action performed against this table then even paying the overhead of having a uniquefier added onto the 8 bytes of the date column will be worth it (thus having a 12 byte clustering key size which is carried over to all non-clustered indexes as well). Test and see since it is a non-production environment. BTW, be prepared for a HUGE transaction log when you shift the clustered index. I would manually expand the tlog to 1.5 times the size of the table, then shrink it back down when indexing is complete.
3) Did you remember to set the initial size of the database and tlog files? Did you set the growth rates appropriately? I would check for OS level file framentation if you didn't. Also make sure your antivirus is set to ignore .mdf and .ldf files.
4) You must accept that at some point you are just asking for more data than your spindle/cpu can chew threw in a few seconds. 🙂
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
May 9, 2007 at 8:43 am
1) IN and JOIN run in same time because the optimizer is most likely rewriting the IN as a JOIN under the covers (show estimated query plan for both to verify this).
2) You are actually doing THREE things that make a clustered index on logwritten better for this particular query: BETWEEN, GROUP BY and ORDER BY. If this is the main action performed against this table then even paying the overhead of having a uniquefier added onto the 8 bytes of the date column will be worth it (thus having a 12 byte clustering key size which is carried over to all non-clustered indexes as well). Test and see since it is a non-production environment. BTW, be prepared for a HUGE transaction log when you shift the clustered index. I would manually expand the tlog to 1.5 times the size of the table, then shrink it back down when indexing is complete.
3) Did you remember to set the initial size of the database and tlog files? Did you set the growth rates appropriately? I would check for OS level file framentation if you didn't. Also make sure your antivirus is set to ignore .mdf and .ldf files.
4) You must accept that at some point you are just asking for more data than your spindle/cpu can chew threw in a few seconds. 🙂
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
May 9, 2007 at 9:19 am
Will get around to checking a few of these ideas later in the week - mainly becuase the request for a clustered index on the date field but the DB over the 4GB limit for SQL express and hence it failed - have to move it to a server with a proper license! yes the date ranges have to be fully adjustable and broken down (when seeking over a couple of days the perforance is OK) - file growth etc was alll sorted from outset, and looked at the OS level defrag - hardly anything - virus checker ignoring it. the string search is in the form '%/podcast/%.mp3' or many other combinations depending upon the report. this side of things has next to 0 cost according the execution plan (the url strings are fully normalised so nothing like the 13m records of the stems). I think the clustered index on the data range is the answer but will have to wait a few days. thanks all for the advice.
May 9, 2007 at 12:18 pm
I always use 'with (nolock)' on my queries. It seems to speed things up a bit...
cheers,
Ben
Ben Sullins
bensullins.com
Beer is my primary key...
May 9, 2007 at 1:42 pm
But you run the risk of getting invalid data due to dirty reads.
Viewing 15 posts - 1 through 15 (of 23 total)
You must be logged in to reply to this topic. Login to reply