Query just hangs!

  • Same company same address = duplicate

    No company but with same address = duplicate

    Unfortunately, what happens if you get:

    different company / same address

    and

    no company / same address,

    then which of the rows in the first line do you dedupe or match the rows in the second line with???

    If you never match "no company" rows with "has company" rows, then dedupe the two separately (ie., if no company, then put a unique string for company name ie., "no company at all" or something along those lines). I'm partial to a "group by" solution myself, obviously thats going to imply a sort somewhere along the lines (and could be simplifying the problem beyond your expectations or still running too long / use too much tempdb, but I like that as opposed to the triangular join).

  • by my reconning there are no duplicates in your sample data. NOTE: quickie, unverified code

    ;with a as (select organisationname, address1, row_number() OVER (partition by organisationname, address1 ORDER BY ID) as rownum

    from #scc)

    select *

    from a

    where rownum <> 1

    select organisationname, address1

    from #scc

    order by organisationname, address1

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • I'm afraid you're asking for code that would make these two:

    21ST CENTURY SERIVES

    21ST CENTURY SERVICES

    equal, not because they are, but because they SHOULD be. Is that true? The business rules to cover all the cases of different spellings, transpositions, and missing elements in your data would be far more complex than I can imagine. If this is what you're asking, TSQL might not be your best avenue to a solution. You might want to contract with a service like D&B that can rationalize these for you. If I've misunderstood, I'm sorry for distracting everyone.


    And then again, I might be wrong ...
    David Webb

  • I haven't checked for all of the columns but the original query had a "Hierarchy" column in it, which is missing from the test data. Considering that the triangular join on that column may be one of the larger problems, I'd recommend regenerating the test table and test data so we can actually have a go at it.

    Also, do you mean to compare a.Address1 with b.Address2 for sure?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • David Webb-CDS (1/27/2014)


    I'm afraid you're asking for code that would make these two:

    21ST CENTURY SERIVES

    21ST CENTURY SERVICES

    equal, not because they are, but because they SHOULD be. Is that true? The business rules to cover all the cases of different spellings, transpositions, and missing elements in your data would be far more complex than I can imagine. If this is what you're asking, TSQL might not be your best avenue to a solution. You might want to contract with a service like D&B that can rationalize these for you. If I've misunderstood, I'm sorry for distracting everyone.

    You could very well be correct! I note there are also (rather expensive) third-party products that will help with this type of fuzzy matching too. But they are a near-must when you have ugly inputs such as this. I see those types of systems a LOT in the healthcare, insurance and government clients I have.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Abu Dina (1/27/2014)


    It's really about finding duplicate entries within the record set. A duplicate in this case is at site/business level.

    Same company same address = duplicate

    No company but with same address = duplicate

    Order the result set by company name and the duplicates will seem obvious.

    In that case it should be as simple as:

    DENSE_RANK() OVER (Order By Organisationname, Address1, Address2, Address3, Address5, Town, County, Postcode, Country) AS SiteID

    Same number assigned for companies with the same name and address

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • David Webb-CDS (1/27/2014)


    I'm afraid you're asking for code that would make these two:

    21ST CENTURY SERIVES

    21ST CENTURY SERVICES

    equal, not because they are, but because they SHOULD be. Is that true? The business rules to cover all the cases of different spellings, transpositions, and missing elements in your data would be far more complex than I can imagine. If this is what you're asking, TSQL might not be your best avenue to a solution. You might want to contract with a service like D&B that can rationalize these for you. If I've misunderstood, I'm sorry for distracting everyone.

    The above company names would be considered a match. You are right about the complexities of this exercise. I try to use TSQL where I'm performing set based operations. When it actually comes down to comparing the company names, I use CLR functions to try and work out the spelling/transpositions/missing words etc....

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

  • Abu Dina (1/28/2014)


    David Webb-CDS (1/27/2014)


    I'm afraid you're asking for code that would make these two:

    21ST CENTURY SERIVES

    21ST CENTURY SERVICES

    equal, not because they are, but because they SHOULD be. Is that true? The business rules to cover all the cases of different spellings, transpositions, and missing elements in your data would be far more complex than I can imagine. If this is what you're asking, TSQL might not be your best avenue to a solution. You might want to contract with a service like D&B that can rationalize these for you. If I've misunderstood, I'm sorry for distracting everyone.

    The above company names would be considered a match.

    In that case, T-SQL is not the right solution. Try SSIS with the fuzzy grouping operator to match similar names?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • ChrisM@Work (1/27/2014)


    Check the actual plan of each to ensure you're getting an index scan on one side and seeks on the other and check the nested loops join operator carefully - better still, post the two plans here so we can play ๐Ÿ™‚

    Yup.. clearly you've been through this exercise before so you know all about the performance of queries when de-duping large record sets.

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

  • GilaMonster (1/28/2014)


    Abu Dina (1/28/2014)


    David Webb-CDS (1/27/2014)


    I'm afraid you're asking for code that would make these two:

    21ST CENTURY SERIVES

    21ST CENTURY SERVICES

    equal, not because they are, but because they SHOULD be. Is that true? The business rules to cover all the cases of different spellings, transpositions, and missing elements in your data would be far more complex than I can imagine. If this is what you're asking, TSQL might not be your best avenue to a solution. You might want to contract with a service like D&B that can rationalize these for you. If I've misunderstood, I'm sorry for distracting everyone.

    The above company names would be considered a match.

    In that case, T-SQL is not the right solution. Try SSIS with the fuzzy grouping operator to match similar names?

    SSIS might be great de-duping smallish record sets. When you get to de-dupe 50-100 million record sets then it's simply not up to the task. So far the best performance I've had is a combination of TSQL and C# CLR TVFs

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

  • Jeff Moden (1/27/2014)


    I haven't checked for all of the columns but the original query had a "Hierarchy" column in it, which is missing from the test data. Considering that the triangular join on that column may be one of the larger problems, I'd recommend regenerating the test table and test data so we can actually have a go at it.

    Also, do you mean to compare a.Address1 with b.Address2 for sure?

    Good point Jeff but I was interested to see how they would de-dupe the 200 records. I kind of forgot about the triangular join. In the end I did as Chris M suggested and split the queries to eliminate the OR. I also forced a join hint and used an INNER LOOP JOIN instead.

    Query performance improved significantly! (with the appropriate indexes of course!

    ... and yes the a.address1 = b.address2 is right, in the past I've missed genuine duplicates because I didn't perform cross matching like the above.

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

  • Gail's suggestion using DENSE_RANK() is well worth a try when you get time.

    โ€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.โ€ - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ChrisM@Work (1/28/2014)


    Gail's suggestion using DENSE_RANK() is well worth a try when you get time.

    Is it?

    So what happens if I have two records with the same Organisationname but the address1 of the first record matches the address2 of the second?

    Is it possible to have multiple PARTITION BYs using the same rank?

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

  • How about 2 passes, first to sort out address similarities and then second to dedupe including Organisation

    Far away is close at hand in the images of elsewhere.
    Anon.

  • Abu Dina (1/29/2014)


    ChrisM@Work (1/28/2014)


    Gail's suggestion using DENSE_RANK() is well worth a try when you get time.

    Is it?

    So what happens if I have two records with the same Organisationname but the address1 of the first record matches the address2 of the second?

    This is why we asked again and again for the expected results, so that we could see what you consider a duplicate and what you don't consider a duplicate before we spent time writing code that doesn't do what you want.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass

Viewing 15 posts - 16 through 30 (of 31 total)

You must be logged in to reply to this topic. Login to reply