Quick question on zip code searches: U.S. and Canada

Question

Post reply

Quick question on zip code searches: U.S. and Canada

matt6749

Ten Centuries

Points: 1204
More actions
June 23, 2011 at 11:57 pm

#242590

I just added 750,000 Canadian zip code records to my existing U.S. zip code table.
Query execution time for a U.S. zip code radius search: ~150 milliseconds
Query execution time for new table radius search with Canadian zip codes: ~1000 milliseconds
So adding all the Canadian zip codes makes the query take 6x longer. Is this what you would expect? Just checking. Thanks.

Viewing 15 posts - 1 through 15 (of 23 total)

You must be logged in to reply to this topic. Login to reply

Sriram.RM Hall of Fame Points: 3368 More actions · Answer 1

Did you re index the table after inserting the records? whats the difference in the execution plan?

[font="Times New Roman"]For better assistance in answering your questions
Click Here[/url][/font]

muthukkumaran Kaliyamoorthy SSCoach Points: 19026 More actions · Answer 2

matt6749 (6/23/2011)
I just added 750,000 Canadian zip code records to my existing U.S. zip code table.
Query execution time for a U.S. zip code search: ~150 milliseconds
Query execution time for new table with Canadian zip codes: ~1000 milliseconds
So adding all the Canadian zip codes makes the query take 6x longer. Is this what you would expect? Just checking. Thanks.

Read the gail's article added in my signature.

Muthukkumaran Kaliyamoorthy
https://www.sqlserverblogforum.com/

matt6749 Ten Centuries Points: 1204 More actions · Answer 3

RE: Gail's article, very good, but I don't think it's necessary for this general question.

matt6749 Ten Centuries Points: 1204 More actions · Answer 4

Sriram.RM (6/24/2011)
Did you re index the table after inserting the records? whats the difference in the execution plan?

Yes, rebuilt indexes.. My guess is that the increased query time makes sense because I just added another 750,000 records. ?

Sriram.RM Hall of Fame Points: 3368 More actions · Answer 5

Ideally Addition of 750000 records shouldn cause much problems.... Did you check the execution plan?

If you want any of us to analyze further into the issue i suggest you to follow the process mentioned in Gail's article as stated above

[font="Times New Roman"]For better assistance in answering your questions
Click Here[/url][/font]

matt6749 Ten Centuries Points: 1204 More actions · Answer 6

Sriram.RM (6/24/2011)
Ideally Addition of 750000 records shouldn cause much problems.... Did you check the execution plan?
If you want any of us to analyze further into the issue i suggest you to follow the process mentioned in Gail's article as stated above

Ok - I've attached the execution plan. Looks like the clustered index scan is most of the cost.. which is good I think..?

Jayanth_Kurup SSC-Insane Points: 22967 More actions · Answer 7

looking at the execution plan , the query 2 accounts for 76 % of your cost and a significant part of it is the index scan.

You might want to see how to make this plan more efficient by getting the optimizer to do an index seek , esp since your only filtering for 5000 approx.

For reasons why see below :-

http://blog.sqlauthority.com/2007/03/30/sql-server-index-seek-vs-index-scan-table-scan/

Jayanth Kurup[/url]

matt6749 Ten Centuries Points: 1204 More actions · Answer 8

Yes, I will look at the query plan more. However, my general question remains - a radius search with 44k zip records is going to be slower that a radius search with 750k zip records. There is obviously going to be some decrease in speed. My question is whether the query time increase from ~150ms to ~850ms is reasonable.

Jayanth_Kurup SSC-Insane Points: 22967 More actions · Answer 9

Form past experience NO , the data volume will result in slower query performance , but in no way are the proportional , i.e double the data <> 1/2 the speed.

Jayanth Kurup[/url]

matt6749 Ten Centuries Points: 1204 More actions · Answer 10

Jayanth_Kurup (6/24/2011)
Form past experience NO , the data volume will result in slower query performance , but in no way are the proportional , i.e double the data <> 1/2 the speed.

I see.. With this zip code radius search, though, there are a bunch of INNER JOINS with the zip code table, so (for me) it's pretty involved.

My data shows: zip table size increased 15x, query time increased 5.5x.

DECLARE @start_lat float

DECLARE @MaxNorthLat float

DECLARE @MaxSouthLat float

SELECT @start_lat = (SELECT Z.Latitude FROM dbo.ZipCodesTest Z WHERE Z.ZipCode = @zip)

SET @MaxNorthLat = @start_lat + (@radius*360/24859.82)

SET @MaxSouthLat = @start_lat - (@radius*360/24859.82)

SELECT * FROM

(

SELECT

TotalRows = COUNT(*) OVER(),

RowNum = ROW_NUMBER()OVER (ORDER BY dbo.fnDistance(z1.Latitude, z1.Longitude, z2.Latitude, z2.Longitude)),

P.ProviderID, P.OccupationID, P.LastName, P.FirstName, P.Designations, P.NumRecommendations, P.City, P.Zip, P.Phone, z1.ZipCode, z1.State, DistanceInMiles = dbo.fnDistance(z1.Latitude, z1.Longitude, z2.Latitude, z2.Longitude)

FROM dbo.Providers P WITH (NOLOCK)

INNER JOIN dbo.ZipCodesTest z1 ON z1.ZipCode = P.Zip

INNER JOIN dbo.ZipCodesTest z2 ON dbo.fnDistance(z1.Latitude, z1.Longitude, z2.Latitude, z2.Longitude) < @radius

AND z1.Latitude BETWEEN @MaxSouthLat AND @MaxNorthLat

AND (@zip IS NULL OR z2.ZipCode = @zip)

)

AS XYZ -- you need this AS XYZ

WHERE RowNum BETWEEN @startRow AND (@startRow + @rowsPerPage) - 1

ORDER BY RowNum ASC

Jeff Moden SSC Guru Points: 1004339 More actions · Answer 11

@matt6749,

Please post the code for fnDistance and then I'll tell you why there's such a time difference. I might also be able to tell you how to fix it, as well.

To summarize, if an INDEX SEEK is being used, then this increase in record volumn should have virtually no impact on performance. Since you have guaranteed that nothing less than an INDEX SCAN will be involved, you have also guaranteed the decrease in performance by increasing the volume.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

matt6749 Ten Centuries Points: 1204 More actions · Answer 12

Jeff Moden (6/24/2011)
@matt6749,
Please post the code for fnDistance and then I'll tell you why there's such a time difference. I might also be able to tell you how to fix it, as well.
To summarize, if an INDEX SEEK is being used, then this increase in record volumn should have virtually no impact on performance. Since you have guaranteed that nothing less than an INDEX SCAN will be involved, you have also guaranteed the decrease in performance by increasing the volume.

Thanks Jeff. I can't seem to be able to figure out how to make it use an Index Seek rather than Scan.

Here is the function:

FUNCTION [dbo].[fnDistance]

(

-- returns distance when passed two points, latitude and longitude

@Lat1 float,

@Long1 float,

@Lat2 float,

@Long2 float

)

RETURNS float

AS

BEGIN

DECLARE @Temp FLOAT

SET @Temp = SIN(@Lat1/57.2957795130823) * SIN(@Lat2/57.2957795130823) + COS(@Lat1/57.2957795130823)

* COS(@Lat2/57.2957795130823) * COS(@Long2/57.2957795130823 - @Long1/57.2957795130823)

IF @Temp > 1

SET @Temp = 1

ELSE IF @Temp < -1

SET @Temp = -1

RETURN (3958.75586574 * ACOS(@Temp) )

END

Jeff Moden SSC Guru Points: 1004339 More actions · Answer 13

Just so you know, I'm working on this and I've solved the problem of the Clustered Index Scan. I've just got some more to do on it.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004339 More actions · Answer 14

Ok... here's the way I'd do this job so that I can get the best speed possible out of it.

First, you have to dump that scalar function for Distance and replace it with a nice, high performance iTVF (inline Table Valued Function)...

CREATE FUNCTION dbo.fnDistance

--===== Returns distance when passed two points, latitude and longitude

-- Converted to a high performance iTVF by Jeff Moden - 24 Jun 2011

(

@Lat1 FLOAT,

@Lon1 FLOAT,

@Lat2 FLOAT,

@Lon2 FLOAT

)

RETURNS TABLE

AS

RETURN WITH

cteFirstPass AS

( --=== Calculate the GREAT CIRCLE distance in Radians using the Haversine Formula

SELECT fp1Temp = (SIN(RADIANS(@Lat1)) * SIN(RADIANS(@Lat2)))

+ (COS(RADIANS(@Lat1)) * COS(RADIANS(@Lat2)))

* COS(RADIANS(@Lon2) - RADIANS(@Lon1))

),

cteSecondPass AS

( --=== This makes a correction not accounted for in the Haversine Formula above

SELECT fp2Temp = CASE WHEN fp1Temp > 1 THEN 1 WHEN fp1Temp < -1 THEN -1 ELSE fp1Temp END

FROM cteFirstPass

) --=== This does the final calculation for distance

-- where 3958.75586574 is the Radius of the Earth in miles (as some see it)

SELECT Distance = 3958.75586574 * ACOS(fp2Temp)

FROM cteSecondPass

GO

The next thing is indexing. I don't know what the indexes are on your ZipCode table because I'm using SQL Server 2005 and couldn't open your 2008 execution plan. With that in mind, I strongly recommend the following 2 indexes. If your clustered index isn't the one I recommend, then drop it for mine. Of course, you'll need to change the table name and columns to match yours. I didn't want to change my table to match yours.

CREATE CLUSTERED INDEX [IX_ZipCodes_Lat_Lon] ON [dbo].[ZipCodes]

(

[Lat] ASC,

[Lon] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

GO

CREATE NONCLUSTERED INDEX [IX_ZipCodes_ZipCode] ON [dbo].[ZipCodes]

(

[ZipCode] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

And, for those that may be wondering... NO! I wasn't the one who created that table-name as a Plural. :sick:

Up next is the code to solve most of the rest of the problem and it does so very, very quickly. The details are all in the code except for one. You might not want to join this with another table because I think it'll make for some rather bad performing code. You can try it but I'd direct the output of this code to a Temp Table (using SELECT/INTO) and then join that to the other tables that you want.

--=================================================================================================

-- This section simulates parameters for a stored proc or iTVF

--=================================================================================================

--===== Declare the parameters for a stored proc or iTVF

DECLARE @pZip CHAR(5), --Change the data-type to match your table

@pRadius INT

;

--===== Preset the parameters as if they were passed to a stored proc or iTVF

SELECT @pZip = '48326',

@pRadius = 15 --Miles assumed because of the 3958.75586574 constant in the function.

; --As a sidebar, you really should document these things in the code.

--=================================================================================================

-- Setup and presets

--=================================================================================================

--===== Declare some obviously-named local variables

DECLARE @StartLat FLOAT,

@StartLon FLOAT,

@MaxNorthLat FLOAT,

@MaxSouthLat FLOAT,

@DegreesPerMile FLOAT

;

--===== Precalculate the Lat/Lon for the ZipCode we're looking for and then

-- figure out the Latitude optimization from that using the search radius

SELECT @DegreesPerMile = 360/24859.82, --24859.82 = Circumference of the Earth

@StartLat = Lat, --Change Lat to the same as in your table

@StartLon = Lon, --Change Lon to the same as in your table

@MaxNorthLat = Lat + (@pRadius*@DegreesPerMile), --Change Lat to the same as in your table

@MaxSouthLat = Lat - (@pRadius*@DegreesPerMile) --Change Lon to the same as in your table

FROM dbo.ZipCodes --Change this to your table name

WHERE ZipCode = @pZip

;

--=================================================================================================

-- This finds all the Zip Codes and actual distances between the provided Zip Code and the

-- Zip Codes in the Zip Code table. Most people make the mistake of trying to calculate and

-- use the Distance in the Join Criteria which guarantees the best you'll do is a Clustered

-- Index Scan which, in truth, is nothing more than a full table scan. I did it the other

-- way around so I could get the Clustered Index Seek on the indexes I built. That's another

-- story. The Clustered Index should NOT be on the ZipCode... it should be on Lat/Lon and

-- then have a NonClustered Index on ZipCode. It makes for Index Seeks on just about any

-- query you could run.

--

-- If you have the indexes I recommended in place, this is NASTY fast.

--=================================================================================================

--===== Using the Max/Min Latitude optimization, return a list of ZipCodes within @Radius of the

-- given ZipCode (@pZip)

WITH

cteFindDistances AS

( --=== Find the optimized band of ZipCodes and calculate their distance using the hi-speed iTVF.

SELECT d.Distance, z.ZipCode

FROM dbo.ZipCodes z

CROSS APPLY dbo.fnDistance(@StartLat, @StartLon, z.Lat, z.Lon) d

WHERE z.Lat >= @MaxSouthLat

AND z.Lat <= @MaxNorthLat

) --=== Return only those within the desired radius.

SELECT Distance, ZipCode

FROM cteFindDistances

WHERE Distance <= @pRadius

ORDER BY Distance

;

That works on all US 5 digit ZipCodes (43,602 in this version) in 36 milliseconds on a 9 year old, single 1.8Ghz CPU with just 1 Gig of RAM. Imagine how fast it'll run on your server.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)