Better way to do this?

  • I have a table of about 30k employees. For each employee I have their ZIP and ZIP+4. I have a second table (Ranged) that has a ZIP, ZIP+4_Start, ZIP+4_Stop and Congressional District. Every employee needs to be assigned a Congressional district from the Ranged table. The employee's district is determined by the two ZIP's matching and the employee's ZIP+4 being between ZIP+4_Start and ZIP+4_Stop in the Ranged table. I've been using this query:

    UPDATE dbo.EMPLOYEE

     SET CongDist = (Select CongDist from dbo.Ranged

        Where dbo.Employee.hzip = dbo.Ranged.ZIP AND

              (dbo.Employee.hzip4 between dbo.Ranged.Start AND

               dbo.Ranged.Stop))

    From dbo.Employee, dbo.Ranged

    But it runs like a dog. The execution plan shows all sorts of index scans, hashes, aggregates, etc. Much more complex than I would have thought.

    The Range table has millions of rows (1 for every ZIP/ZIP+4 to District combination). I put a clustered index on the Ranged table using the ZIP, ZIP+4_Start and ZIP+4_Stop. That helped a bit, but it's still very slow.

    Can anyone think of a better approach (this one was my best shot, it's better than the CURSOR my predecessor was using to run through the Employees table and lookup each employee zip etc in the Ranged table. Yeesh!)

    TIA 

     

  • You could try this: 

    UPDATE dbo.EMPLOYEE SET

              CongDist = dbo.Ranged.CongDist

    FROM dbo.Ranged

         INNER JOIN dbo.Employee ON( dbo.Employee.hzip = dbo.Ranged.ZIP

              AND( dbo.Employee.hzip4 BETWEEN dbo.Ranged.Start AND dbo.Ranged.Stop)

    INDEX = NameOfYourClusteredIndex

    I wasn't born stupid - I had to study.

  • Do you ALWAYS have to perform this on the entire table?  IF not you could shorten the amount of employees by using:

    UPDATE E1 SET CongDist = R1.CongDist

      FROM dbo.Ranged R1

        INNER JOIN (SELECT hzip, hzip4

                      FROM dbo.Employee

                    WHERE CongDist IS NULL) E1

          ON E1.hzip = R1.ZIP

            AND E1.hzip4 BETWEEN R1.Start R1.Stop

    OR

    UPDATE E1 SET CongDist = R1.CongDist

      FROM (SELECT hzip, hzip4

              FROM dbo.Employee

            WHERE CongDist IS NULL) E1

        INNER JOIN dbo.Ranged R1

          ON E1.hzip = R1.ZIP

            AND E1.hzip4 BETWEEN R1.Start R1.Stop

    Whichever table has the least rows should be the FROM (at least that is what I was told once)....



    Good Hunting!

    AJ Ahrens


    webmaster@kritter.net

  • Yup. The table is periodically rebuilt (i.e. for congressional redistricting) and then we need to run it on the whole table.

    I tried:

    UPDATE dbo.EMPLOYEE SET

              CongDist = dbo.Ranged.CongDist

    FROM dbo.Ranged

         INNER JOIN dbo.Employee ON( dbo.Employee.hzip = dbo.Ranged.ZIP

              AND( dbo.Employee.hzip4 BETWEEN dbo.Ranged.Start AND dbo.Ranged.Stop)

    INDEX = IX_Zip

    and

    UPDATE dbo.EMPLOYEE SET

              CongDist = dbo.Ranged.CongDist

    FROM dbo.Ranged

         INNER JOIN dbo.Employee ON( dbo.Employee.hzip = dbo.Ranged.ZIP

              AND( dbo.Employee.hzip4 BETWEEN dbo.Ranged.Start AND dbo.Ranged.Stop)

    INDEX = IX_Zip )

    and

    UPDATE dbo.EMPLOYEE SET

              CongDist = dbo.Ranged.CongDist

    FROM dbo.Ranged

         INNER JOIN dbo.Employee ON( dbo.Employee.hzip = dbo.Ranged.ZIP

              AND( dbo.Employee.hzip4 BETWEEN dbo.Ranged.Start AND dbo.Ranged.Stop))

    INDEX = IX_Zip

    Got Server: Msg 156, Level 15, State 1, Line 6

    Incorrect syntax near the keyword 'INDEX'.

    on every one of them...

     

  • Try not using the index hint first to see if the performance is better...

  • Definately try it without the Index hint.  You could also try putting parenthesis around the (INDEX = IX_Zip).

    I wasn't born stupid - I had to study.

  • Three seconds faster than the original query...

     

  • How much time does it take now?

  • 1:12 (old) vs 1:09 (new). It doesn't seem like it's a lot but I have to run a total of 9 queries just like it for different target tables all about the same size. The total job runs 10+ minutes. I was hoping to get it down by at least a few minutes.

    I could probably speed it up by playing with INDEX and NOLOCK hints if I could figure out the syntax, but the execution plan shows it's using the IX_Zip index (so why put in a hint) and if I turn off locking I could screw up users of the application.

    If that's what it is then that's what it is...

     

  • 1 min to update 1m rows doesn't seem like a long time to me but I've never really worked with tables that big. Maybe you could merge some of the updates so that you only have to run 4-5 queries instead of 10.

  • Actually it's 1 min to update 30k rows using the 1+M row table as the source for the Update command. Looking at it I might be able to combine a couple of the queries. I'll try that and see if it extends the time by much.

     

  • It sounds like you do not need the Index hint.  Do you have and index on Employee.hzip? 

    Between can be a difficulty especially since a zip code cannot be numeric.  But..., since you are already using an equality on the zip, you may want to do a sub-select (maybe an INNER JOIN within parenthesis) on just the longer zip code records which match the Ranged.Zip.  Once you have these records isolated, it should may speed up the BETWEEN. 

    You might want to try playing with that kind of viewpoint as well...

     

    UPDATE dbo.Employee SET

              CongDist = R.CongDist

    FROM dbo.Ranged R

         INNER JOIN( SELECT R2.Start, R2.Stop FROM dbo.Ranged R2 WHERE dbo.Employee.hzip = R2.Zip) Range

    WHERE dbo.Employee.hzip4 BETWEEN Range.Start AND Range.Stop

     

    I wasn't born stupid - I had to study.

  • Farrell zip codes are numbers although their function is nominative rather than relational. They are used to descripe areas covered by a Post Office. The first number represents a group of states, the second and third umbers define a region within those states and the 4th and 5th number represent more specific areas with in that region. The zip + 4 is a further division of the area described by the first 5 digits. A 9 digit (zip + 4) code is a subset of a 5 digit zip code.

    You will not need 2 col in your db one for 5 digit zip codes and one for 9 digit zip codes as you can always extract the first 5 digits in you want to describe a broader geographical area.

    You should be able to use the coding schema of the post office to refine your second table. For instance a congressional distirct would cover all areas served by zip codes where the first three numbers are 322 (Northeast Florida).

    Mike

  • Another thought on zip codes is that since they are nominative you can not assume that consective zip codes are geographically connected. If a new post office were to open next door to me then my zip could change from 32207 to 32277 which is the next non used number in the Geographic area described by the zip code 322 (North east florida). Trying to determine the congressional district for an employee by zip codes which can change be changed by the Post Office with no assurance that the new code will not split a congressional distirct may not be sound. Additionally the boundries of a congressional district can be changed. If I wanted to know the congressional district of an employee Added by edit. Its  always bad form to not offer a solution when commenting on what could be flaws in the current process.

    Mike

  • Hopefully to close the debate on the soundness of determining Congressional districts from Zip Codes, we are provided a database of Zip Code to District matching by a data vendor. They provide quarterly updates (or more frequently if something is radically changed) to ensure the mapping is accurate.

    No, I don't really need to keep the ZIP and ZIP+4 in separate columns, but that's the way the vendor's data comes. THey could have used:

             Start               End                   District

             061040000       061040139          05

    But instead chose to go with:

             ZIP                Start            End         District

             06104             0000            0000       05

    It seems to me that it is more efficient for me to store my ZIP+4 in a separate column for my Employees, especially since our HR department doesn't collect it so I have to use another software package to generate it from their address (we have more problems with that than with ZIP->District). If I stored the employee's ZIP as 061040130 then I'd have to use RIGHT(ZIP,4) everywhere I wanted to compare to the vendor's Ranged table and that would slow down the query.

    I suppose I could restructure the vendor's table as part of my DTS job to load it, but why bother? I don't think it would significantly improve the performance.

     

     

Viewing 15 posts - 1 through 15 (of 22 total)

You must be logged in to reply to this topic. Login to reply