May 7, 2010 at 2:14 pm
We have a proposal to build a lookup table that contains Zip Code pairs and the distance between them. Several attempts to calculate the distances (using lat and long) on the fly for reporting have caused too much of a wait (per the business unit) for the end user. My fear is that creating this lookup table of 180MIL rows or so will not help the situation and will use up more resources than it's worth.
The table would be simple Zip1, Zip2, Distance.
Some of the more common uses would be find all Values of Zip2 within 10 miles of Zip1 etc.
We have though of subdividing zip codes by geography but this will always lead to exclusions across boundaries. We've tried writing the lookup using CLR but haven't seen any performance increase.
Thanks for any input or pointers on where to look for this sort of situation. I have this sinking feeling I'm missing something obvious but I've been staring at it too long.
Thanks
May 7, 2010 at 2:25 pm
I strongly vote against that proposal!
Alternatively I would recommend to tune the on-the-fly solution.
Just because the current solution has performance issues does not imply in any way that there is no other (faster) solution available.
I don't think I can be of any help in terms of the "180Mill solution" but I will try to help you to speed up the on-the-fly-calculation. 😉
Please provide table def (including indexes and sample data) and a description of the reporting scenario. (If the report always holds all zip-codes and the distance between each other I might question the report as well, though...)
May 7, 2010 at 2:31 pm
Oh, I need to add one more question: how accurate do you need to calculate the distance? Or, in other words: What is the largest distance you're dealing with?
May 7, 2010 at 2:43 pm
Cory Blythe (5/7/2010)
The table would be simple Zip1, Zip2, Distance.Some of the more common uses would be find all Values of Zip2 within 10 miles of Zip1 etc.
Once table gets populated and there is an index on (Zip1,Distance) I do not see any particular issue.
Does business requirement asks to resolve distances larger than 10 miles? 50 Miles? 250 miles? 1,000 miles?
If not, table will be much smaller a.k.a. storing only rows within NN miles from each zip code.
_____________________________________
Pablo (Paul) Berzukov
Author of Understanding Database Administration available at Amazon and other bookstores.
Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.May 7, 2010 at 3:29 pm
PaulB-TheOneAndOnly (5/7/2010)
Cory Blythe (5/7/2010)
The table would be simple Zip1, Zip2, Distance.Some of the more common uses would be find all Values of Zip2 within 10 miles of Zip1 etc.
Once table gets populated and there is an index on (Zip1,Distance) I do not see any particular issue.
Does business requirement asks to resolve distances larger than 10 miles? 50 Miles? 250 miles? 1,000 miles?
If not, table will be much smaller a.k.a. storing only rows within NN miles from each zip code.
That goes in the direction I was thinking of - kind of, at least... (in terms of reducing the number of rows to deal with),
I would have used the original table. Based on the zip code and the required distance I'd use a linear equation to limit the number of possible matches (basically, building a square with a length of 2x(distance + "safety") with the zip code centered.
The result set would be stored in an indexed temp table and perform an approximate distance calculation or use a CLR to find the zips within the circle (or the sphere).
Advantage: I wouldn't have to add rows to the distance table if the limit will exceed the distances currently stored. Therefore this solution is more robust.
Disadvantage: Due to the approximate distance calculation there will be a decreasing accuracy for increasing distances. It's most probably slower than the "populated table" solution as long as the distance in question is covered.
Depending on the business requirement I probably would use a combination of both methods: use the table data if exist and if not reduce the number of zip codes used for the calcualtion. This would allow to decide later on if the table needs to get expanded or if it was simply a typo (e.g. 10,000 = ten thousand miles vs. 10.000 = ten miles)...
May 8, 2010 at 11:32 am
Cory Blythe (5/7/2010)
Some of the more common uses would be find all Values of Zip2 within 10 miles of Zip1 etc.
One of the very useful performance tweeks here is to simply calculate the Min/Max Lat/Lon from the desired distance and then applying the radial distance formulas to the much more limited set that provides to "hone in". Some folks wisely limit the Min/Max to Lon only because lines of Lat are not parallel and require a fair bit more calculation as you approach the poles. In other words, there are relatively few ZipCodes in a single 20 mile (10 mile radius) swath cut across the United States which greatly reduces the number of Zip Codes you have to lookup.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 18, 2010 at 1:24 pm
I just got back to this project after being pulle doff for one of many fire drills.
I'd like to say thanks for all great suggestions, questions, and input.
By asking the business unit to define their maximum distance to reduce the table size and creating the suggested indexing after the table was created we've got a viable solution with a response time improvement of a factor of 10.
Thanks again for all the great input.
May 19, 2010 at 3:11 pm
One factoid, and Two (and a half) options.
The first 3 digits of a zipcode are assigned to a specific geographic area (generally). the size of the area depends on the density of zip codes, eg. New York, NY will have several 3 digit zones vs. Alaska likely has only one.
If you went crazy, you could create 1,000 partitions which could improve performance greatly.
(Pre comment)
You should only be storing the pairs that will match your search criteria.
Storing the distance between Seattle and Washington D.C. does not make sense, as you will never hit that row. (Assumption)
(Solution 1)
You can get a significant improvement by creating a reduced map. e.g. all suburbs in Cleveland are connected to one point. (I would look at using 10% of the total distance).
Then you store the distance between two points. (Chicago). And only deal with the ones that are less then your search value (+/- 20%)
You then calculated based on Lat, Long the distance between the individual zipcodes.
This is basically the same way that google maps (etc.) direction work. Get you out of your house to a bigger street, continue until you are on a big enough street to get you near your destination, and then reverse.
(Solution 2) Another option, to reduce the size is to precalculate which zipcodes are within each range and only store zipa, zipb for each cutoff.
So you have:
Table A-B(x) for 10 Mi
A,B1
A,B2
Table A-B(x) for 100 Mi
etc.
Note that all of the tables will preform better, however the 10 Mi will preform better than the 100 Mi table, because of fewer matches.
(Solution 2.5)
This is one of the rare times where de-normalization can be a good thing. Because zip codes are always formatted the same, you can consider violating a fundamental law, and have a tuple be multivalued ...
If you store only one row for each source zipcode, and a list of all destination zips, the worst your table will have is 100,000 rows.
The process time to split up that list is small vs. keeping a purely normalized table is likely significantly better.
May 19, 2010 at 3:43 pm
If this is something you are still working on, there is a fairly extensive discussion of the techniques for defining the square around the circle on this thread:
Great Circle Distance Function - Haversine Formula
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply