July 5, 2009 at 1:46 pm
Here's my exact source(attached), and I used SSMA to copy the data up to SQL Server 2005 Developer Edition 64-bit with SP2. SSMA is SQL Server Migration Assistant. I set that product to specifically convert the decimal field to decimal(8,3). By the way, the final index took just over an hour to create, and the final db size is ~70 GIG.
I created the database on my C: drive (the only one I have at the moment - I use the 2nd half of the drive as my backup destination for opsys backups for now. The two partitions are 500 GB each. I used 16,384 MB as the creation size for the db file, and set the log file to max out at 2048 MB.
The code I've used to perform the various tasks, albeit a batch or portion thereof at a time, is here:
CREATE TABLE dbo.DISTANCES (
ZIP1 char(5),
ZIP2 char(5),
DISTANCE decimal(7,2)
)
DECLARE @START DateTime, @END DateTime, @DURATION decimal(12,3), @LOOPER int, @LOOPCHAR varchar(5)
SET @LOOPER = 0
SET @START = GETDATE()
PRINT 'STARTING DATE/TIME: ' + CAST(@START AS varchar(30))
WHILE (@LOOPER < 10000)
BEGIN
SET @LOOPCHAR = RIGHT('000' + CAST(@LOOPER AS varchar(4)),4) + '%'
PRINT 'LOOP START FOR #' + @LOOPCHAR
INSERT INTO dbo.DISTANCES (ZIP1, ZIP2, DISTANCE)
SELECT Z.[ZIP Code] AS ZIP1, Y.[ZIP Code] AS ZIP2,
CASE
WHEN (Z.Latitude = Y.Latitude) AND (Z.Longitude = Y.Longitude) THEN
CAST(0 AS decimal(7,2))
ELSE ROUND(69.1 * 180 / PI() * ACOS(
SIN(RADIANS(Z.Latitude)) * SIN(RADIANS(Y.Latitude)) +
COS(RADIANS(Z.Latitude)) * COS(RADIANS(Y.Latitude)) *
COS(RADIANS(Y.Longitude) - RADIANS(Z.Longitude))),2)
END AS DISTANCE
FROM dbo.[ZIP Codes] AS Z, dbo.[ZIP Codes] AS Y
WHERE Z.[ZIP Code] < Y.[ZIP Code] AND
Z.[ZIP Code] Like @LOOPCHAR
SET @LOOPER = @LOOPER + 1
PRINT 'LOOP END #' + @LOOPCHAR
CHECKPOINT
BACKUP LOG ZIP_CODES WITH TRUNCATE_ONLY
DBCC SHRINKFILE(ZIP_CODES_log)
PRINT 'LOG MAINTENANCE COMPLETED'
END
SET @END = GETDATE()
SET @DURATION = CAST(DATEDIFF(ms, @START, @END) AS decimal(12,3)) / 1000.
PRINT 'ENDING DATE/TIME: ' + CAST(@END AS varchar(30))
PRINT 'TOTAL DURATION WAS: ' + CAST(@DURATION AS varchar(30)) + ' SECONDS'
GO
ALTER TABLE dbo.DISTANCES
ADD DIST_CATEGORY smallint
CREATE NONCLUSTERED INDEX IX_ZIP1
ON dbo.DISTANCES(ZIP1 ASC)
WITH FILLFACTOR = 100
GO
CREATE NONCLUSTERED INDEX IX_ZIP2
ON dbo.DISTANCES(ZIP2 ASC)
WITH FILLFACTOR = 100
GO
DECLARE @START DateTime, @END DateTime, @DURATION decimal(12,3), @LOOPER int, @LOOPCHAR varchar(5)
SET @LOOPER = 0
SET @START = GETDATE()
PRINT 'STARTING DATE/TIME: ' + CAST(@START AS varchar(30))
WHILE (@LOOPER < 1000)
BEGIN
SET @LOOPCHAR = RIGHT('00' + CAST(@LOOPER AS varchar(3)),3) + '%'
PRINT 'LOOP START FOR #' + @LOOPCHAR
UPDATE dbo.DISTANCES
SET DIST_CATEGORY =
CASE WHEN DISTANCE <= 150 THEN 5 * (((CAST(ROUND(DISTANCE + .49,0) AS smallint) - 1) / 5) + 1)
WHEN DISTANCE <= 250 THEN 250
WHEN DISTANCE <= 500 THEN 500
WHEN DISTANCE <= 1000 THEN 1000
WHEN DISTANCE <= 1500 THEN 1500
WHEN DISTANCE <= 2000 THEN 2000
WHEN DISTANCE <= 2500 THEN 2500
WHEN DISTANCE <= 5000 THEN 5000
WHEN DISTANCE <= 7500 THEN 7500
WHEN DISTANCE <= 10000 THEN 10000
ELSE 10001
END
WHERE ZIP1 Like @LOOPCHAR
SET @LOOPER = @LOOPER + 1
PRINT 'LOOP END #' + @LOOPCHAR
CHECKPOINT
BACKUP LOG ZIP_CODES WITH TRUNCATE_ONLY
DBCC SHRINKFILE(ZIP_CODES_log)
PRINT 'LOG MAINTENANCE COMPLETED'
END
SET @END = GETDATE()
SET @DURATION = CAST(DATEDIFF(ms, @START, @END) AS decimal(12,3)) / 1000.
PRINT 'ENDING DATE/TIME: ' + CAST(@END AS varchar(30))
PRINT 'TOTAL DURATION WAS: ' + CAST(@DURATION AS varchar(30)) + ' SECONDS'
GO
CREATE NONCLUSTERED INDEX IX_DIST_CATEGORY
ON dbo.DISTANCES(DIST_CATEGORY ASC)
WITH FILLFACTOR = 100
GO
I'll look forward to whatever you come up with. What I'm most eager to learn is just what gets logged, and under what conditions.
Steve
(aka smunson)
:-):-):-)
Jeff Moden (7/5/2009)
Never mind, Steve... I found one that's a CSV file at
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
July 5, 2009 at 2:15 pm
Just an FYI, the following query runs in as little as 0.04 seconds to as much as 0.16 seconds - always fast, and always sub-second, and that's a good thing.
DBCC FREESYSTEMCACHE ('ALL')
DECLARE @FIND_ZIP AS char(5), @START AS DateTime, @END AS DateTime, @DURATION AS decimal(8,3)
SET @START = GETDATE()
SET @FIND_ZIP = '60609'
SELECT ZIP1, ZIP2, DISTANCE
FROM dbo.DISTANCES
WHERE ZIP1 = @FIND_ZIP AND
DIST_CATEGORY <= 25
ORDER BY DISTANCE, ZIP2
SET @END = GETDATE()
SET @DURATION = DATEDIFF(ms, @START, @END) / 1000.
SELECT 'QUERY COMPLETED IN ' + CAST(@DURATION AS varchar(8)) + ' SECONDS' AS DURATION
Steve
(aka smunson)
:-):-):-)
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
July 5, 2009 at 2:34 pm
UPDATE:
Well, actually, my last query wasn't quite right. I just realized that only checking ZIP1 provides only those zip codes larger than than the one being checked, as I had decided to cut the database size in half by only having unique records where the combination of the two zip codes is concerned. Here's the revised query:
DBCC FREESYSTEMCACHE ('ALL')
DECLARE @FIND_ZIP AS char(5), @START AS DateTime, @END AS DateTime, @DURATION AS decimal(8,3)
SET @START = GETDATE()
SET @FIND_ZIP = '68105'
SELECT @FIND_ZIP AS ZIP1, ZIP2, DISTANCE
FROM dbo.DISTANCES
WHERE ZIP1 = @FIND_ZIP AND
DIST_CATEGORY <= 25
UNION
SELECT @FIND_ZIP AS ZIP1, ZIP1 AS ZIP2, DISTANCE
FROM dbo.DISTANCES
WHERE ZIP2 = @FIND_ZIP AND
DIST_CATEGORY <= 25
ORDER BY DISTANCE, ZIP1, ZIP2
SET @END = GETDATE()
SET @DURATION = DATEDIFF(ms, @START, @END) / 1000.
SELECT @START AS STARTING, @END AS ENDING,
'QUERY COMPLETED IN ' + CAST(@DURATION AS varchar(8)) + ' SECONDS' AS DURATION
The performance is still sub-second, but can go up to .4 something seconds after the very first query, which ran in 1.7 seconds.
Steve
(aka smunson)
:-):-):-)
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
July 5, 2009 at 4:46 pm
Just check Zip 2 because there will always be all zip codes in that column.
--Jeff Moden
Change is inevitable... Change for the better is not.
July 5, 2009 at 4:47 pm
I meant to ask... what are you doing this for? What's the business reason, I mean?
--Jeff Moden
Change is inevitable... Change for the better is not.
July 5, 2009 at 5:14 pm
Would you believe that it's entirely a learning process? I'm trying out things, finding out what kind of stuff happens when you have a huge database, and finding out just how well it can perform. Since I'm not working right now (much as I'd PREFER to be working... unemployment ROTS!), I have the time to learn, so the more I know about things I might encounter in the real world, the better. Thus, the only BUSINESS is that of education. I've already learned a great deal from this forum, and I plan to have that continue for a long time.
There are always so many ways to do something, and thus I need to be able to know what kinds of things might be worth trying.
Jeff Moden (7/5/2009)
I meant to ask... what are you doing this for? What's the business reason, I mean?
As to just checking ZIP2, that can't work because for any given pair of zip codes, there is exactly one record in the DISTANCES table. Thus if X is the ZIP I'm searching for, if I only check against ZIP1, I get all the pairs where ZIP2 is > X, and if I only check against ZIP2, I get the opposite result, where I get all the pairs where ZIP1 < X. I need both sets together, and putting an OR in a single query isn't going to perform worth beans - I let that fly once, and cancelled the query a minute and a half later, knowing I was on the wrong track.
Jeff Moden (7/5/2009)
Just check Zip 2 because there will always be all zip codes in that column.
Steve
(aka smunson)
:-):-):-)
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
July 5, 2009 at 5:46 pm
Understood on the "business" of education... I do it for at least an hour a day. Kind of like practicing piano. Heh... I thought you had some extreme need for speed and thus the denormalization of the table.
On the Zip2 thing... I haven't done the test in a while, but I'm pretty sure that a simple OR will be resolved in a fashion similar to a UNION ALL. You could try that.
--Jeff Moden
Change is inevitable... Change for the better is not.
July 5, 2009 at 7:31 pm
On the "need for speed", that was more of a wish thing... I have an expensive system, and I have expectations for it - realistic in some cases, perhaps not in others - and in this case, it was more of a test drive to see what it can do. This kind of database could feed a mapping application that could grab a map from Yahoo, MapQuest, or Google, and plot the locations of every certain kind of business within a given radius of a typed in zip code. Of course, one would also need data on the business location zip codes and the associated addresses/phone numbers, but what the heck...
I'm attaching an Excel 2007 spreadsheet with the execution plans for the two ways to run the query, and I'm just not curious enough to find out how long the "OR" version would take to run, given the execution plan it generates.
Steve
(aka smunson)
:-):-):-)
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
July 5, 2009 at 7:53 pm
smunson (7/5/2009)
I'm just not curious enough to find out how long the "OR" version would take to run, given the execution plan it generates.:-):-):-)
What? Unemployed... in search of knowledge... nothing for the computer to do while your sleeping... 😉
And remember, execution plans can lie like a rug.
--Jeff Moden
Change is inevitable... Change for the better is not.
July 5, 2009 at 8:35 pm
Unemployment is a tad too common these days. Anyway, the idea of an execution plan doing the "lie like a rug" thing has already been put to bed. Given my subsecond response to the UNION query, and the at least 1.5 minutes and probably an AWFUL LOT more for the OR query, that's my logic, but since I'm the paduan learner here, I'll be thorough and let it run tonight. I'll update in the morning. I expect it to be bad - bad to the core, as it were... (:-D:-D:-D)
Steve
(aka smunson)
:-D:-D:-D
Jeff Moden (7/5/2009)
smunson (7/5/2009)
I'm just not curious enough to find out how long the "OR" version would take to run, given the execution plan it generates.:-):-):-)
What? Unemployed... in search of knowledge... nothing for the computer to do while your sleeping... 😉
And remember, execution plans can lie like a rug.
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
July 6, 2009 at 8:03 am
UPDATE:
The "OR" version of the query took 3 mins, 48.383 seconds. Way too long to be able to support any kind of website that might use this kind of database. The difference is a whopping 571 times longer than the worst case query of 0.4 seconds for the UNION version, and over 2,000 times longer than the more typical 0.1 seconds. I do have to admit I expected far worse, but I think this is bad enough on it's own merit.
What you said about the execution plan being able to "lie like a rug" is no doubt accurate, and when I look at the EstimatedExecutions columns for both plans, the UNION version sums to 415 while the OR version sums to over 21,000, for a difference of about 51 times. I would expect the difference in of another 40x to appear in the respecttive average actual cost of each execution, but I don't have the knowledge right now to be able to figure out if that's indeed the case.
This has been a most useful exercise in learning about large databases. I just wish I could learn more about exactly what goes into the log file, and when. Apparently, it sure can be one heck of a lot.
Steve
(aka smunson)
:-):-):-)
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
July 8, 2009 at 7:32 pm
It would be interesting to see the complete code examples in the test harness you used.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 12 posts - 16 through 26 (of 26 total)
You must be logged in to reply to this topic. Login to reply