Query just hangs!

Question

Post reply

Query just hangs!

Viewing 15 posts - 16 through 30 (of 31 total)

You must be logged in to reply to this topic. Login to reply

x SSC-Insane Points: 23660 More actions · Answer 1

Same company same address = duplicate
No company but with same address = duplicate

Unfortunately, what happens if you get:

different company / same address

and

no company / same address,

then which of the rows in the first line do you dedupe or match the rows in the second line with???

If you never match "no company" rows with "has company" rows, then dedupe the two separately (ie., if no company, then put a unique string for company name ie., "no company at all" or something along those lines). I'm partial to a "group by" solution myself, obviously thats going to imply a sort somewhere along the lines (and could be simplifying the problem beyond your expectations or still running too long / use too much tempdb, but I like that as opposed to the triangular join).

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 2

by my reconning there are no duplicates in your sample data. NOTE: quickie, unverified code

;with a as (select organisationname, address1, row_number() OVER (partition by organisationname, address1 ORDER BY ID) as rownum

from #scc)

select *

from a

where rownum <> 1

select organisationname, address1

from #scc

order by organisationname, address1

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

David Webb-CDS SSCoach Points: 17398 More actions · Answer 3

I'm afraid you're asking for code that would make these two:

21ST CENTURY SERIVES

21ST CENTURY SERVICES

equal, not because they are, but because they SHOULD be. Is that true? The business rules to cover all the cases of different spellings, transpositions, and missing elements in your data would be far more complex than I can imagine. If this is what you're asking, TSQL might not be your best avenue to a solution. You might want to contract with a service like D&B that can rationalize these for you. If I've misunderstood, I'm sorry for distracting everyone.

And then again, I might be wrong ...
David Webb

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 4

I haven't checked for all of the columns but the original query had a "Hierarchy" column in it, which is missing from the test data. Considering that the triangular join on that column may be one of the larger problems, I'd recommend regenerating the test table and test data so we can actually have a go at it.

Also, do you mean to compare a.Address1 with b.Address2 for sure?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 5

David Webb-CDS (1/27/2014)
I'm afraid you're asking for code that would make these two:
21ST CENTURY SERIVES
21ST CENTURY SERVICES
equal, not because they are, but because they SHOULD be. Is that true? The business rules to cover all the cases of different spellings, transpositions, and missing elements in your data would be far more complex than I can imagine. If this is what you're asking, TSQL might not be your best avenue to a solution. You might want to contract with a service like D&B that can rationalize these for you. If I've misunderstood, I'm sorry for distracting everyone.

You could very well be correct! I note there are also (rather expensive) third-party products that will help with this type of fuzzy matching too. But they are a near-must when you have ugly inputs such as this. I see those types of systems a LOT in the healthcare, insurance and government clients I have.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 6

Abu Dina (1/27/2014)
It's really about finding duplicate entries within the record set. A duplicate in this case is at site/business level.
Same company same address = duplicate
No company but with same address = duplicate
Order the result set by company name and the duplicates will seem obvious.

In that case it should be as simple as:

DENSE_RANK() OVER (Order By Organisationname, Address1, Address2, Address3, Address5, Town, County, Postcode, Country) AS SiteID

Same number assigned for companies with the same name and address

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Abu Dina SSChampion Points: 14155 More actions · Answer 7

David Webb-CDS (1/27/2014)
I'm afraid you're asking for code that would make these two:
21ST CENTURY SERIVES
21ST CENTURY SERVICES
equal, not because they are, but because they SHOULD be. Is that true? The business rules to cover all the cases of different spellings, transpositions, and missing elements in your data would be far more complex than I can imagine. If this is what you're asking, TSQL might not be your best avenue to a solution. You might want to contract with a service like D&B that can rationalize these for you. If I've misunderstood, I'm sorry for distracting everyone.

The above company names would be considered a match. You are right about the complexities of this exercise. I try to use TSQL where I'm performing set based operations. When it actually comes down to comparing the company names, I use CLR functions to try and work out the spelling/transpositions/missing words etc....

---------------------------------------------------------

It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens[/url]

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 8

Abu Dina (1/28/2014)
David Webb-CDS (1/27/2014)
I'm afraid you're asking for code that would make these two:
21ST CENTURY SERIVES
21ST CENTURY SERVICES
equal, not because they are, but because they SHOULD be. Is that true? The business rules to cover all the cases of different spellings, transpositions, and missing elements in your data would be far more complex than I can imagine. If this is what you're asking, TSQL might not be your best avenue to a solution. You might want to contract with a service like D&B that can rationalize these for you. If I've misunderstood, I'm sorry for distracting everyone.
The above company names would be considered a match.

In that case, T-SQL is not the right solution. Try SSIS with the fuzzy grouping operator to match similar names?

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Abu Dina SSChampion Points: 14155 More actions · Answer 9

ChrisM@Work (1/27/2014)
Check the actual plan of each to ensure you're getting an index scan on one side and seeks on the other and check the nested loops join operator carefully - better still, post the two plans here so we can play 🙂

Yup.. clearly you've been through this exercise before so you know all about the performance of queries when de-duping large record sets.

---------------------------------------------------------

It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens[/url]

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn

Abu Dina SSChampion Points: 14155 More actions · Answer 10

GilaMonster (1/28/2014)
Abu Dina (1/28/2014)
David Webb-CDS (1/27/2014)
I'm afraid you're asking for code that would make these two:
21ST CENTURY SERIVES
21ST CENTURY SERVICES
equal, not because they are, but because they SHOULD be. Is that true? The business rules to cover all the cases of different spellings, transpositions, and missing elements in your data would be far more complex than I can imagine. If this is what you're asking, TSQL might not be your best avenue to a solution. You might want to contract with a service like D&B that can rationalize these for you. If I've misunderstood, I'm sorry for distracting everyone.
The above company names would be considered a match.
In that case, T-SQL is not the right solution. Try SSIS with the fuzzy grouping operator to match similar names?

SSIS might be great de-duping smallish record sets. When you get to de-dupe 50-100 million record sets then it's simply not up to the task. So far the best performance I've had is a combination of TSQL and C# CLR TVFs

---------------------------------------------------------

It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens[/url]

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn

Abu Dina SSChampion Points: 14155 More actions · Answer 11

Jeff Moden (1/27/2014)
I haven't checked for all of the columns but the original query had a "Hierarchy" column in it, which is missing from the test data. Considering that the triangular join on that column may be one of the larger problems, I'd recommend regenerating the test table and test data so we can actually have a go at it.
Also, do you mean to compare a.Address1 with b.Address2 for sure?

Good point Jeff but I was interested to see how they would de-dupe the 200 records. I kind of forgot about the triangular join. In the end I did as Chris M suggested and split the queries to eliminate the OR. I also forced a join hint and used an INNER LOOP JOIN instead.

Query performance improved significantly! (with the appropriate indexes of course!

... and yes the a.address1 = b.address2 is right, in the past I've missed genuine duplicates because I didn't perform cross matching like the above.

---------------------------------------------------------

It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens[/url]

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 12

Gail's suggestion using DENSE_RANK() is well worth a try when you get time.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Abu Dina SSChampion Points: 14155 More actions · Answer 13

ChrisM@Work (1/28/2014)
Gail's suggestion using DENSE_RANK() is well worth a try when you get time.

Is it?

So what happens if I have two records with the same Organisationname but the address1 of the first record matches the address2 of the second?

Is it possible to have multiple PARTITION BYs using the same rank?

---------------------------------------------------------

It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens[/url]

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn

David Burrows SSC Guru Points: 65143 More actions · Answer 14

How about 2 passes, first to sort out address similarities and then second to dedupe including Organisation

Far away is close at hand in the images of elsewhere.
Anon.

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 15

Abu Dina (1/29/2014)
ChrisM@Work (1/28/2014)
Gail's suggestion using DENSE_RANK() is well worth a try when you get time.
Is it?
So what happens if I have two records with the same Organisationname but the address1 of the first record matches the address2 of the second?

This is why we asked again and again for the expected results, so that we could see what you consider a duplicate and what you don't consider a duplicate before we spent time writing code that doesn't do what you want.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass