January 27, 2014 at 2:59 pm
Same company same address = duplicate
No company but with same address = duplicate
Unfortunately, what happens if you get:
different company / same address
and
no company / same address,
then which of the rows in the first line do you dedupe or match the rows in the second line with???
If you never match "no company" rows with "has company" rows, then dedupe the two separately (ie., if no company, then put a unique string for company name ie., "no company at all" or something along those lines). I'm partial to a "group by" solution myself, obviously thats going to imply a sort somewhere along the lines (and could be simplifying the problem beyond your expectations or still running too long / use too much tempdb, but I like that as opposed to the triangular join).
January 27, 2014 at 4:36 pm
by my reconning there are no duplicates in your sample data. NOTE: quickie, unverified code
;with a as (select organisationname, address1, row_number() OVER (partition by organisationname, address1 ORDER BY ID) as rownum
from #scc)
select *
from a
where rownum <> 1
select organisationname, address1
from #scc
order by organisationname, address1
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
January 27, 2014 at 4:55 pm
I'm afraid you're asking for code that would make these two:
21ST CENTURY SERIVES
21ST CENTURY SERVICES
equal, not because they are, but because they SHOULD be. Is that true? The business rules to cover all the cases of different spellings, transpositions, and missing elements in your data would be far more complex than I can imagine. If this is what you're asking, TSQL might not be your best avenue to a solution. You might want to contract with a service like D&B that can rationalize these for you. If I've misunderstood, I'm sorry for distracting everyone.
January 27, 2014 at 5:25 pm
I haven't checked for all of the columns but the original query had a "Hierarchy" column in it, which is missing from the test data. Considering that the triangular join on that column may be one of the larger problems, I'd recommend regenerating the test table and test data so we can actually have a go at it.
Also, do you mean to compare a.Address1 with b.Address2 for sure?
--Jeff Moden
Change is inevitable... Change for the better is not.
January 27, 2014 at 6:34 pm
David Webb-CDS (1/27/2014)
I'm afraid you're asking for code that would make these two:21ST CENTURY SERIVES
21ST CENTURY SERVICES
equal, not because they are, but because they SHOULD be. Is that true? The business rules to cover all the cases of different spellings, transpositions, and missing elements in your data would be far more complex than I can imagine. If this is what you're asking, TSQL might not be your best avenue to a solution. You might want to contract with a service like D&B that can rationalize these for you. If I've misunderstood, I'm sorry for distracting everyone.
You could very well be correct! I note there are also (rather expensive) third-party products that will help with this type of fuzzy matching too. But they are a near-must when you have ugly inputs such as this. I see those types of systems a LOT in the healthcare, insurance and government clients I have.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
January 27, 2014 at 10:30 pm
Abu Dina (1/27/2014)
It's really about finding duplicate entries within the record set. A duplicate in this case is at site/business level.Same company same address = duplicate
No company but with same address = duplicate
Order the result set by company name and the duplicates will seem obvious.
In that case it should be as simple as:
DENSE_RANK() OVER (Order By Organisationname, Address1, Address2, Address3, Address5, Town, County, Postcode, Country) AS SiteID
Same number assigned for companies with the same name and address
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
January 28, 2014 at 1:50 am
David Webb-CDS (1/27/2014)
I'm afraid you're asking for code that would make these two:21ST CENTURY SERIVES
21ST CENTURY SERVICES
equal, not because they are, but because they SHOULD be. Is that true? The business rules to cover all the cases of different spellings, transpositions, and missing elements in your data would be far more complex than I can imagine. If this is what you're asking, TSQL might not be your best avenue to a solution. You might want to contract with a service like D&B that can rationalize these for you. If I've misunderstood, I'm sorry for distracting everyone.
The above company names would be considered a match. You are right about the complexities of this exercise. I try to use TSQL where I'm performing set based operations. When it actually comes down to comparing the company names, I use CLR functions to try and work out the spelling/transpositions/missing words etc....
---------------------------------------------------------
It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens[/url]
Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
January 28, 2014 at 1:55 am
Abu Dina (1/28/2014)
David Webb-CDS (1/27/2014)
I'm afraid you're asking for code that would make these two:21ST CENTURY SERIVES
21ST CENTURY SERVICES
equal, not because they are, but because they SHOULD be. Is that true? The business rules to cover all the cases of different spellings, transpositions, and missing elements in your data would be far more complex than I can imagine. If this is what you're asking, TSQL might not be your best avenue to a solution. You might want to contract with a service like D&B that can rationalize these for you. If I've misunderstood, I'm sorry for distracting everyone.
The above company names would be considered a match.
In that case, T-SQL is not the right solution. Try SSIS with the fuzzy grouping operator to match similar names?
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
January 28, 2014 at 2:17 am
ChrisM@Work (1/27/2014)
Check the actual plan of each to ensure you're getting an index scan on one side and seeks on the other and check the nested loops join operator carefully - better still, post the two plans here so we can play ๐
Yup.. clearly you've been through this exercise before so you know all about the performance of queries when de-duping large record sets.
---------------------------------------------------------
It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens[/url]
Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
January 28, 2014 at 2:21 am
GilaMonster (1/28/2014)
Abu Dina (1/28/2014)
David Webb-CDS (1/27/2014)
I'm afraid you're asking for code that would make these two:21ST CENTURY SERIVES
21ST CENTURY SERVICES
equal, not because they are, but because they SHOULD be. Is that true? The business rules to cover all the cases of different spellings, transpositions, and missing elements in your data would be far more complex than I can imagine. If this is what you're asking, TSQL might not be your best avenue to a solution. You might want to contract with a service like D&B that can rationalize these for you. If I've misunderstood, I'm sorry for distracting everyone.
The above company names would be considered a match.
In that case, T-SQL is not the right solution. Try SSIS with the fuzzy grouping operator to match similar names?
SSIS might be great de-duping smallish record sets. When you get to de-dupe 50-100 million record sets then it's simply not up to the task. So far the best performance I've had is a combination of TSQL and C# CLR TVFs
---------------------------------------------------------
It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens[/url]
Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
January 28, 2014 at 2:25 am
Jeff Moden (1/27/2014)
I haven't checked for all of the columns but the original query had a "Hierarchy" column in it, which is missing from the test data. Considering that the triangular join on that column may be one of the larger problems, I'd recommend regenerating the test table and test data so we can actually have a go at it.Also, do you mean to compare a.Address1 with b.Address2 for sure?
Good point Jeff but I was interested to see how they would de-dupe the 200 records. I kind of forgot about the triangular join. In the end I did as Chris M suggested and split the queries to eliminate the OR. I also forced a join hint and used an INNER LOOP JOIN instead.
Query performance improved significantly! (with the appropriate indexes of course!
... and yes the a.address1 = b.address2 is right, in the past I've missed genuine duplicates because I didn't perform cross matching like the above.
---------------------------------------------------------
It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens[/url]
Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
January 28, 2014 at 3:01 am
Gail's suggestion using DENSE_RANK() is well worth a try when you get time.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
January 29, 2014 at 3:56 am
ChrisM@Work (1/28/2014)
Gail's suggestion using DENSE_RANK() is well worth a try when you get time.
Is it?
So what happens if I have two records with the same Organisationname but the address1 of the first record matches the address2 of the second?
Is it possible to have multiple PARTITION BYs using the same rank?
---------------------------------------------------------
It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens[/url]
Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
January 29, 2014 at 4:02 am
How about 2 passes, first to sort out address similarities and then second to dedupe including Organisation
Far away is close at hand in the images of elsewhere.
Anon.
January 29, 2014 at 4:05 am
Abu Dina (1/29/2014)
ChrisM@Work (1/28/2014)
Gail's suggestion using DENSE_RANK() is well worth a try when you get time.Is it?
So what happens if I have two records with the same Organisationname but the address1 of the first record matches the address2 of the second?
This is why we asked again and again for the expected results, so that we could see what you consider a duplicate and what you don't consider a duplicate before we spent time writing code that doesn't do what you want.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
Viewing 15 posts - 16 through 30 (of 31 total)
You must be logged in to reply to this topic. Login to reply