Better way to do this?

Question

Post reply

Better way to do this?

rschaeferhig

SSCrazy Eights

Points: 8776
More actions
May 17, 2005 at 11:16 am

#86387

I have a table of about 30k employees. For each employee I have their ZIP and ZIP+4. I have a second table (Ranged) that has a ZIP, ZIP+4_Start, ZIP+4_Stop and Congressional District. Every employee needs to be assigned a Congressional district from the Ranged table. The employee's district is determined by the two ZIP's matching and the employee's ZIP+4 being between ZIP+4_Start and ZIP+4_Stop in the Ranged table. I've been using this query:
UPDATE dbo.EMPLOYEE
SET CongDist = (Select CongDist from dbo.Ranged
    Where dbo.Employee.hzip = dbo.Ranged.ZIP AND
          (dbo.Employee.hzip4 between dbo.Ranged.Start AND
           dbo.Ranged.Stop))
From dbo.Employee, dbo.Ranged
But it runs like a dog. The execution plan shows all sorts of index scans, hashes, aggregates, etc. Much more complex than I would have thought.
The Range table has millions of rows (1 for every ZIP/ZIP+4 to District combination). I put a clustered index on the Ranged table using the ZIP, ZIP+4_Start and ZIP+4_Stop. That helped a bit, but it's still very slow.
Can anyone think of a better approach (this one was my best shot, it's better than the CURSOR my predecessor was using to run through the Employees table and lookup each employee zip etc in the Ranged table. Yeesh!)
TIA

Viewing 15 posts - 1 through 15 (of 22 total)

You must be logged in to reply to this topic. Login to reply

Farrell Keough SSCoach Points: 17414 More actions · Answer 1

You could try this:

UPDATE dbo.EMPLOYEE SET

CongDist = dbo.Ranged.CongDist

FROM dbo.Ranged

INNER JOIN dbo.Employee ON( dbo.Employee.hzip = dbo.Ranged.ZIP

AND( dbo.Employee.hzip4 BETWEEN dbo.Ranged.Start AND dbo.Ranged.Stop)

INDEX = NameOfYourClusteredIndex

I wasn't born stupid - I had to study.

AJ Ahrens SSC-Insane Points: 20676 More actions · Answer 2

Do you ALWAYS have to perform this on the entire table? IF not you could shorten the amount of employees by using:

UPDATE E1 SET CongDist = R1.CongDist

FROM dbo.Ranged R1

INNER JOIN (SELECT hzip, hzip4

FROM dbo.Employee

WHERE CongDist IS NULL) E1

ON E1.hzip = R1.ZIP

AND E1.hzip4 BETWEEN R1.Start R1.Stop

OR

UPDATE E1 SET CongDist = R1.CongDist

FROM (SELECT hzip, hzip4

FROM dbo.Employee

WHERE CongDist IS NULL) E1

INNER JOIN dbo.Ranged R1

ON E1.hzip = R1.ZIP

AND E1.hzip4 BETWEEN R1.Start R1.Stop

Whichever table has the least rows should be the FROM (at least that is what I was told once)....

Good Hunting!

AJ Ahrens

webmaster@kritter.net

rschaeferhig SSCrazy Eights Points: 8776 More actions · Answer 3

Yup. The table is periodically rebuilt (i.e. for congressional redistricting) and then we need to run it on the whole table.

I tried:

UPDATE dbo.EMPLOYEE SET

CongDist = dbo.Ranged.CongDist

FROM dbo.Ranged

INNER JOIN dbo.Employee ON( dbo.Employee.hzip = dbo.Ranged.ZIP

AND( dbo.Employee.hzip4 BETWEEN dbo.Ranged.Start AND dbo.Ranged.Stop)

INDEX = IX_Zip

and

UPDATE dbo.EMPLOYEE SET

CongDist = dbo.Ranged.CongDist

FROM dbo.Ranged

INNER JOIN dbo.Employee ON( dbo.Employee.hzip = dbo.Ranged.ZIP

AND( dbo.Employee.hzip4 BETWEEN dbo.Ranged.Start AND dbo.Ranged.Stop)

INDEX = IX_Zip )

and

UPDATE dbo.EMPLOYEE SET

CongDist = dbo.Ranged.CongDist

FROM dbo.Ranged

INNER JOIN dbo.Employee ON( dbo.Employee.hzip = dbo.Ranged.ZIP

AND( dbo.Employee.hzip4 BETWEEN dbo.Ranged.Start AND dbo.Ranged.Stop))

INDEX = IX_Zip

Got Server: Msg 156, Level 15, State 1, Line 6

Incorrect syntax near the keyword 'INDEX'.

on every one of them...

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 4

Try not using the index hint first to see if the performance is better...

Farrell Keough SSCoach Points: 17414 More actions · Answer 5

Definately try it without the Index hint. You could also try putting parenthesis around the (INDEX = IX_Zip).

I wasn't born stupid - I had to study.

rschaeferhig SSCrazy Eights Points: 8776 More actions · Answer 6

rschaeferhig

SSCrazy Eights

Points: 8776

May 17, 2005 at 12:34 pm

#559349

Three seconds faster than the original query...

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 7

Ninja's_RGR'us

SSC Guru

Points: 294069

May 17, 2005 at 12:37 pm

#559353

How much time does it take now?

rschaeferhig SSCrazy Eights Points: 8776 More actions · Answer 8

1:12 (old) vs 1:09 (new). It doesn't seem like it's a lot but I have to run a total of 9 queries just like it for different target tables all about the same size. The total job runs 10+ minutes. I was hoping to get it down by at least a few minutes.

I could probably speed it up by playing with INDEX and NOLOCK hints if I could figure out the syntax, but the execution plan shows it's using the IX_Zip index (so why put in a hint) and if I turn off locking I could screw up users of the application.

If that's what it is then that's what it is...

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 9

1 min to update 1m rows doesn't seem like a long time to me but I've never really worked with tables that big. Maybe you could merge some of the updates so that you only have to run 4-5 queries instead of 10.

rschaeferhig SSCrazy Eights Points: 8776 More actions · Answer 10

Actually it's 1 min to update 30k rows using the 1+M row table as the source for the Update command. Looking at it I might be able to combine a couple of the queries. I'll try that and see if it extends the time by much.

Farrell Keough SSCoach Points: 17414 More actions · Answer 11

It sounds like you do not need the Index hint. Do you have and index on Employee.hzip?

Between can be a difficulty especially since a zip code cannot be numeric. But..., since you are already using an equality on the zip, you may want to do a sub-select (maybe an INNER JOIN within parenthesis) on just the longer zip code records which match the Ranged.Zip. Once you have these records isolated, it should may speed up the BETWEEN.

You might want to try playing with that kind of viewpoint as well...

UPDATE dbo.Employee SET

CongDist = R.CongDist

FROM dbo.Ranged R

INNER JOIN( SELECT R2.Start, R2.Stop FROM dbo.Ranged R2 WHERE dbo.Employee.hzip = R2.Zip) Range

WHERE dbo.Employee.hzip4 BETWEEN Range.Start AND Range.Stop

I wasn't born stupid - I had to study.

Michael Du Bois SSCertifiable Points: 5163 More actions · Answer 12

Farrell zip codes are numbers although their function is nominative rather than relational. They are used to descripe areas covered by a Post Office. The first number represents a group of states, the second and third umbers define a region within those states and the 4th and 5th number represent more specific areas with in that region. The zip + 4 is a further division of the area described by the first 5 digits. A 9 digit (zip + 4) code is a subset of a 5 digit zip code.

You will not need 2 col in your db one for 5 digit zip codes and one for 9 digit zip codes as you can always extract the first 5 digits in you want to describe a broader geographical area.

You should be able to use the coding schema of the post office to refine your second table. For instance a congressional distirct would cover all areas served by zip codes where the first three numbers are 322 (Northeast Florida).

Mike

Michael Du Bois SSCertifiable Points: 5163 More actions · Answer 13

Another thought on zip codes is that since they are nominative you can not assume that consective zip codes are geographically connected. If a new post office were to open next door to me then my zip could change from 32207 to 32277 which is the next non used number in the Geographic area described by the zip code 322 (North east florida). Trying to determine the congressional district for an employee by zip codes which can change be changed by the Post Office with no assurance that the new code will not split a congressional distirct may not be sound. Additionally the boundries of a congressional district can be changed. If I wanted to know the congressional district of an employee Added by edit. Its always bad form to not offer a solution when commenting on what could be flaws in the current process.

Mike

rschaeferhig SSCrazy Eights Points: 8776 More actions · Answer 14

Hopefully to close the debate on the soundness of determining Congressional districts from Zip Codes, we are provided a database of Zip Code to District matching by a data vendor. They provide quarterly updates (or more frequently if something is radically changed) to ensure the mapping is accurate.

No, I don't really need to keep the ZIP and ZIP+4 in separate columns, but that's the way the vendor's data comes. THey could have used:

Start End District

061040000 061040139 05

But instead chose to go with:

ZIP Start End District

06104 0000 0000 05

It seems to me that it is more efficient for me to store my ZIP+4 in a separate column for my Employees, especially since our HR department doesn't collect it so I have to use another software package to generate it from their address (we have more problems with that than with ZIP->District). If I stored the employee's ZIP as 061040130 then I'd have to use RIGHT(ZIP,4) everywhere I wanted to compare to the vendor's Ranged table and that would slow down the query.

I suppose I could restructure the vendor's table as part of my DTS job to load it, but why bother? I don't think it would significantly improve the performance.