Identify the missing element [Query]

  • Hi all, I have a new challenge for you!

    I have written a simple routine to do some address checking between sources. The business owner wanted the check done to the address as a whole and to report the address' that did not return a match. The comparison works perfectly fine.

    Well, now the customer has come back wanting to know what element(s) of the address make the match fail. I'm having a booger of a time coming up with an efficient solution.

    Here's the basic address compare:

    SELECT *

    FROM Full_Address_Map fam

    WHERE fam.StreetName + ' ' + fam.RoadType + ' ' + fam.CityName + ' ' + fam.StateName + ' ' + fam.Country COLLATE Cyrillic_General_CI_AI NOT IN

    (

    SELECT s.base_name + ' ' + s.road_type + ' ' + p.name + ' ' + v.name + ' ' + v.country_name COLLATE Cyrillic_General_CI_AI

    FROM tbl_street s

    INNER JOIN tbl_place p

    ON s.parent_id = p.place_id

    INNER JOIN state_list_view v

    ON s.state_id = v.place_id

    WHERE s.base_name COLLATE Cyrillic_General_CI_AI = fam.StreetName COLLATE Cyrillic_General_CI_AI

    AND s.road_type COLLATE Cyrillic_General_CI_AI = fam.RoadType COLLATE Cyrillic_General_CI_AI

    AND p.name COLLATE Cyrillic_General_CI_AI = fam.CityName COLLATE Cyrillic_General_CI_AI

    AND v.name COLLATE Cyrillic_General_CI_AI = fam.StateName COLLATE Cyrillic_General_CI_AI

    AND v.country_name COLLATE Cyrillic_General_CI_AI = fam.Country COLLATE Cyrillic_General_CI_AI

    )

    You can see that we are building the address by sticking the 5 columns together with a space in between to create a single string. We are then doing the same thing with the other source data and simply comparing the string. This worked perfectly and was nice and simple... until the customer decided he wanted to see the element(s) causing the issue.

    For example, let's say there is an address of '123 Main ST Dallas Texas United States' in one source, but it's '123 Main Dallas Texas United States' in the other. The customer wants to know that it is the road type (ST) that caused the mismatch.

    Now, I wrote a bunch of queries to try and single out each element in order to determine if it is the one causing the issue or not. It seems to work partially, but I don't believe it is the best approach and it doesn't work if more than one column is the culprit.

    Any ideas? I'm not opposed to changing the way I do the initial address compare, if that will prove to be the better approach.

  • I don't have access to a pc at the moment otherwise I would give a better example...

    You could accomplish this using a splitter and splitting the strings using the space as a delimiter; note the splitter in my signature line.

    I will put an example together when I am at a pc (likely not tonight)

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Ok, I had a few minutes to kill. There will be a little more work to get this to work with what you have done thus far.

    The Function:

    CREATE FUNCTION dbo.stringDiff(@s1 varchar(100), @s2 varchar(100))

    RETURNS TABLE AS

    RETURN

    WITH

    s1 AS

    (

    SELECT s = 1, *

    FROM dbo.DelimitedSplit8K(@s1,' ') s1

    ),

    s2 AS

    (

    SELECT s = 2, *

    FROM dbo.DelimitedSplit8K(@s2,' ') s1

    )

    SELECT TOP 1

    s1 = @s1,

    s2 = @s2,

    Diff_Position = s1.ItemNumber,

    s1_txt = isnull(s1.Item,''),

    s2_txt = isnull(s2.Item,'')

    FROM s1

    LEFT JOIN s2 ON s1.Item = s2.Item

    WHERE isnull(s1.Item,'') <> isnull(s2.Item,'')

    ORDER BY s1.ItemNumber;

    GO

    Use:

    DECLARE @s1 varchar(100) = '123 Main ST Dallas Texas United States',

    @s2 varchar(100) = '123 Main Dallas Texas United States';

    SELECT * FROM dbo.stringDiff(@s1, @s2);

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • -- If you change your correlated NOT IN subquery to a standard NOT EXISTS subquery like this:

    SELECT *

    FROM Full_Address_Map fam

    WHERE NOT EXISTS (

    SELECT 1

    FROM tbl_street s

    INNER JOIN tbl_place p

    ON s.parent_id = p.place_id

    INNER JOIN state_list_view v

    ON s.state_id = v.place_id

    WHERE s.base_name COLLATE Cyrillic_General_CI_AI = fam.StreetName COLLATE Cyrillic_General_CI_AI

    AND s.road_type COLLATE Cyrillic_General_CI_AI = fam.RoadType COLLATE Cyrillic_General_CI_AI

    AND p.name COLLATE Cyrillic_General_CI_AI = fam.CityName COLLATE Cyrillic_General_CI_AI

    AND v.name COLLATE Cyrillic_General_CI_AI = fam.StateName COLLATE Cyrillic_General_CI_AI

    AND v.country_name COLLATE Cyrillic_General_CI_AI = fam.Country COLLATE Cyrillic_General_CI_AI

    )

    -- which is very nearly the logical equivalent, then you can easily eliminate predicates from the WHERE clause.

    -- You will probably have to decide which columns you can safely mis-match on, or you could end up with tens of thousands of

    -- false positives: for instance, mismatches on StreetName + RoadType would give everybody in the same city/state/country.

    -- Google "fuzzy matching" to learn about how this exercise is usually done.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply