May 17, 2005 at 11:16 am
I have a table of about 30k employees. For each employee I have their ZIP and ZIP+4. I have a second table (Ranged) that has a ZIP, ZIP+4_Start, ZIP+4_Stop and Congressional District. Every employee needs to be assigned a Congressional district from the Ranged table. The employee's district is determined by the two ZIP's matching and the employee's ZIP+4 being between ZIP+4_Start and ZIP+4_Stop in the Ranged table. I've been using this query:
UPDATE dbo.EMPLOYEE
SET CongDist = (Select CongDist from dbo.Ranged
Where dbo.Employee.hzip = dbo.Ranged.ZIP AND
(dbo.Employee.hzip4 between dbo.Ranged.Start AND
dbo.Ranged.Stop))
From dbo.Employee, dbo.Ranged
But it runs like a dog. The execution plan shows all sorts of index scans, hashes, aggregates, etc. Much more complex than I would have thought.
The Range table has millions of rows (1 for every ZIP/ZIP+4 to District combination). I put a clustered index on the Ranged table using the ZIP, ZIP+4_Start and ZIP+4_Stop. That helped a bit, but it's still very slow.
Can anyone think of a better approach (this one was my best shot, it's better than the CURSOR my predecessor was using to run through the Employees table and lookup each employee zip etc in the Ranged table. Yeesh!)
TIA
May 17, 2005 at 11:29 am
You could try this:
UPDATE dbo.EMPLOYEE SET
CongDist = dbo.Ranged.CongDist
FROM dbo.Ranged
INNER JOIN dbo.Employee ON( dbo.Employee.hzip = dbo.Ranged.ZIP
AND( dbo.Employee.hzip4 BETWEEN dbo.Ranged.Start AND dbo.Ranged.Stop)
INDEX = NameOfYourClusteredIndex
I wasn't born stupid - I had to study.
May 17, 2005 at 11:39 am
Do you ALWAYS have to perform this on the entire table? IF not you could shorten the amount of employees by using:
UPDATE E1 SET CongDist = R1.CongDist
FROM dbo.Ranged R1
INNER JOIN (SELECT hzip, hzip4
FROM dbo.Employee
WHERE CongDist IS NULL) E1
ON E1.hzip = R1.ZIP
AND E1.hzip4 BETWEEN R1.Start R1.Stop
OR
UPDATE E1 SET CongDist = R1.CongDist
FROM (SELECT hzip, hzip4
FROM dbo.Employee
WHERE CongDist IS NULL) E1
INNER JOIN dbo.Ranged R1
ON E1.hzip = R1.ZIP
AND E1.hzip4 BETWEEN R1.Start R1.Stop
Whichever table has the least rows should be the FROM (at least that is what I was told once)....
Good Hunting!
AJ Ahrens
webmaster@kritter.net
May 17, 2005 at 12:02 pm
Yup. The table is periodically rebuilt (i.e. for congressional redistricting) and then we need to run it on the whole table.
I tried:
UPDATE dbo.EMPLOYEE SET
CongDist = dbo.Ranged.CongDist
FROM dbo.Ranged
INNER JOIN dbo.Employee ON( dbo.Employee.hzip = dbo.Ranged.ZIP
AND( dbo.Employee.hzip4 BETWEEN dbo.Ranged.Start AND dbo.Ranged.Stop)
INDEX = IX_Zip
and
UPDATE dbo.EMPLOYEE SET
CongDist = dbo.Ranged.CongDist
FROM dbo.Ranged
INNER JOIN dbo.Employee ON( dbo.Employee.hzip = dbo.Ranged.ZIP
AND( dbo.Employee.hzip4 BETWEEN dbo.Ranged.Start AND dbo.Ranged.Stop)
INDEX = IX_Zip )
and
UPDATE dbo.EMPLOYEE SET
CongDist = dbo.Ranged.CongDist
FROM dbo.Ranged
INNER JOIN dbo.Employee ON( dbo.Employee.hzip = dbo.Ranged.ZIP
AND( dbo.Employee.hzip4 BETWEEN dbo.Ranged.Start AND dbo.Ranged.Stop))
INDEX = IX_Zip
Got Server: Msg 156, Level 15, State 1, Line 6
Incorrect syntax near the keyword 'INDEX'.
on every one of them...
May 17, 2005 at 12:05 pm
Try not using the index hint first to see if the performance is better...
May 17, 2005 at 12:26 pm
Definately try it without the Index hint. You could also try putting parenthesis around the (INDEX = IX_Zip).
I wasn't born stupid - I had to study.
May 17, 2005 at 12:34 pm
Three seconds faster than the original query...
May 17, 2005 at 12:37 pm
How much time does it take now?
May 17, 2005 at 12:41 pm
1:12 (old) vs 1:09 (new). It doesn't seem like it's a lot but I have to run a total of 9 queries just like it for different target tables all about the same size. The total job runs 10+ minutes. I was hoping to get it down by at least a few minutes.
I could probably speed it up by playing with INDEX and NOLOCK hints if I could figure out the syntax, but the execution plan shows it's using the IX_Zip index (so why put in a hint) and if I turn off locking I could screw up users of the application.
If that's what it is then that's what it is...
May 17, 2005 at 1:04 pm
1 min to update 1m rows doesn't seem like a long time to me but I've never really worked with tables that big. Maybe you could merge some of the updates so that you only have to run 4-5 queries instead of 10.
May 17, 2005 at 1:07 pm
Actually it's 1 min to update 30k rows using the 1+M row table as the source for the Update command. Looking at it I might be able to combine a couple of the queries. I'll try that and see if it extends the time by much.
May 17, 2005 at 1:13 pm
It sounds like you do not need the Index hint. Do you have and index on Employee.hzip?
Between can be a difficulty especially since a zip code cannot be numeric. But..., since you are already using an equality on the zip, you may want to do a sub-select (maybe an INNER JOIN within parenthesis) on just the longer zip code records which match the Ranged.Zip. Once you have these records isolated, it should may speed up the BETWEEN.
You might want to try playing with that kind of viewpoint as well...
UPDATE dbo.Employee SET
CongDist = R.CongDist
FROM dbo.Ranged R
INNER JOIN( SELECT R2.Start, R2.Stop FROM dbo.Ranged R2 WHERE dbo.Employee.hzip = R2.Zip) Range
WHERE dbo.Employee.hzip4 BETWEEN Range.Start AND Range.Stop
I wasn't born stupid - I had to study.
May 18, 2005 at 3:06 am
Farrell zip codes are numbers although their function is nominative rather than relational. They are used to descripe areas covered by a Post Office. The first number represents a group of states, the second and third umbers define a region within those states and the 4th and 5th number represent more specific areas with in that region. The zip + 4 is a further division of the area described by the first 5 digits. A 9 digit (zip + 4) code is a subset of a 5 digit zip code.
You will not need 2 col in your db one for 5 digit zip codes and one for 9 digit zip codes as you can always extract the first 5 digits in you want to describe a broader geographical area.
You should be able to use the coding schema of the post office to refine your second table. For instance a congressional distirct would cover all areas served by zip codes where the first three numbers are 322 (Northeast Florida).
Mike
May 18, 2005 at 4:20 am
Another thought on zip codes is that since they are nominative you can not assume that consective zip codes are geographically connected. If a new post office were to open next door to me then my zip could change from 32207 to 32277 which is the next non used number in the Geographic area described by the zip code 322 (North east florida). Trying to determine the congressional district for an employee by zip codes which can change be changed by the Post Office with no assurance that the new code will not split a congressional distirct may not be sound. Additionally the boundries of a congressional district can be changed. If I wanted to know the congressional district of an employee Added by edit. Its always bad form to not offer a solution when commenting on what could be flaws in the current process.
Mike
May 18, 2005 at 10:44 am
Hopefully to close the debate on the soundness of determining Congressional districts from Zip Codes, we are provided a database of Zip Code to District matching by a data vendor. They provide quarterly updates (or more frequently if something is radically changed) to ensure the mapping is accurate.
No, I don't really need to keep the ZIP and ZIP+4 in separate columns, but that's the way the vendor's data comes. THey could have used:
Start End District
061040000 061040139 05
But instead chose to go with:
ZIP Start End District
06104 0000 0000 05
It seems to me that it is more efficient for me to store my ZIP+4 in a separate column for my Employees, especially since our HR department doesn't collect it so I have to use another software package to generate it from their address (we have more problems with that than with ZIP->District). If I stored the employee's ZIP as 061040130 then I'd have to use RIGHT(ZIP,4) everywhere I wanted to compare to the vendor's Ranged table and that would slow down the query.
I suppose I could restructure the vendor's table as part of my DTS job to load it, but why bother? I don't think it would significantly improve the performance.
Viewing 15 posts - 1 through 15 (of 22 total)
You must be logged in to reply to this topic. Login to reply