how to find duplicate rows, dynamically, with a loop

Question

how to find duplicate rows, dynamically, with a loop

cuseman03

SSC Enthusiast

Points: 194
More actions
December 19, 2019 at 7:24 pm

#3709221

Hi,
I might be asking too much here (in terms of code) but I need to find duplicates but dynamically. Meaning let's say I have a DB with 50 tables. I need to have SQL scan every table and do a having count>1, but without we me ever entering the table or column names. I know this involves the schema code like information.schema type stuff but beyond that I don't know. I know I need to do a loop through but have not a clue of how I would do this as I'm not advanced at SQL. We're talking about duplicate rows, not individual values in individual columns.

Thanks!

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply

Jonathan AC Roberts SSCoach Points: 18354 More actions · Answer 1

SELECT CONCAT('
SELECT ',x.c, '
  FROM ',QUOTENAME(t.TABLE_SCHEMA), '.', QUOTENAME(t.TABLE_NAME),'
 GROUP BY ', x.c, '
HAVING COUNT(*) > 1;')
  FROM INFORMATION_SCHEMA.TABLES t
 CROSS APPLY (VALUES(STUFF((SELECT ',' + QUOTENAME(c.COLUMN_NAME)
                              FROM INFORMATION_SCHEMA.COLUMNS c
                             WHERE c.TABLE_SCHEMA = t.TABLE_SCHEMA
                               AND c.TABLE_NAME = t.TABLE_NAME
                             ORDER BY c.ORDINAL_POSITION ASC
                               FOR XML PATH (''), TYPE).value('.', 'nvarchar(max)'),1,1,''))) x(C)
GO

Jeff Moden SSC Guru Points: 1004343 More actions · Answer 2

Why do you need to do this? Any table with a unique index on it, even a non-clustered one, will not have duplicates in it. Same goes for any table with a PK because the PK is enforced with a unique index. I agree that the PK or unique index would make even otherwise duplicate rows unique and those could be candidates for de-duping but why do you think you need to check for duplicates for every table in a database?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

cuseman03 SSC Enthusiast Points: 194 More actions · Answer 3

well the PK alone doesn't indicate real world duplicates as in you can create an index column of sorts but that doesn't solve potential duplicates and even where there is a truly unique PK that still doesn't prevent duplicates as I have a warehouse that ingests data from over 20 sources on a daily basis. Need to make sure the data coming in doesn't have duplicates.

Jonathan AC Roberts SSCoach Points: 18354 More actions · Answer 4

So it's for staging tables on your database, not your application tables?

cuseman03 SSC Enthusiast Points: 194 More actions · Answer 5

Hi..thanks so much. just one question, what does the x.c in the select statement mean to infer?

Jonathan AC Roberts SSCoach Points: 18354 More actions · Answer 6

cuseman03 wrote:

Hi..thanks so much. just one question, what does the x.c in the select statement mean to infer?

That's just the list of columns from the cross apply "x(C)", you can call it whatever you like.

cuseman03 SSC Enthusiast Points: 194 More actions · Answer 7

you would think but and maybe it will be but for now b/c this hasn't been implemented yet (stupidly) it is on prod tables. there prob aren't many if any but the check makes it official and removes any speculation

cuseman03 SSC Enthusiast Points: 194 More actions · Answer 8

so I ran that and it worked but I'm not sure how to interpret the results. First, is there a way for me to simply spit out the metadata such as when it finds a duplicate it gives me the table name, the count, and the record numbers.

something like this

Table X |14-22| 8

this is a sample record of the query based on your code

SELECT [assetTag],[brandName],[contractDEnd],[contractRef],[dtDepreciation],[dtInvent],[dtLastScan],[enrichedTcpIpHostname],[itServiceArea],[locBarCode],[locFullName],[locFacility],[modelFullName],[modelName],[natureName],[operatingSystem],[portfolioCode],[regionCode],[room],[seAssignment],[serverType],[tcpIpDomain],[tcpIpHostname],[translatedAssignment],[viClusterName],[REF_DATE],[ETL_DATE],[RunSourceID] FROM [dbo].[HPAM_CB_SERVERS] GROUP BY [assetTag],[brandName],[contractDEnd],[contractRef],[dtDepreciation],[dtInvent],[dtLastScan],[enrichedTcpIpHostname],HAVING COUNT(*) > 1;

Jonathan AC Roberts SSCoach Points: 18354 More actions · Answer 9

SELECT CONCAT('
SELECT ''', QUOTENAME(t.TABLE_SCHEMA), '.', QUOTENAME(t.TABLE_NAME),  ''' [TABLE_NAME], COUNT(*) COUNT 
  FROM (SELECT ',x.c, '
          FROM ',QUOTENAME(t.TABLE_SCHEMA), '.', QUOTENAME(t.TABLE_NAME),'
         GROUP BY ', x.c, '
        HAVING COUNT(*) > 1) y;') SQL
  FROM INFORMATION_SCHEMA.TABLES t
 CROSS APPLY (VALUES (STUFF((SELECT ',' + QUOTENAME(c.COLUMN_NAME)
                               FROM INFORMATION_SCHEMA.COLUMNS c
                              WHERE c.TABLE_SCHEMA = t.TABLE_SCHEMA
                                AND c.TABLE_NAME = t.TABLE_NAME
                              ORDER BY c.ORDINAL_POSITION ASC
                                FOR XML PATH (''), TYPE).value('.', 'nvarchar(max)'),1,1,''))) x(C)
GO

I don't know what "14-22" is in your request?

You will get it better formatted if you go into the menu option Query/Options and select "Retain CR/LF on copy or save"

Capture

Jeff Moden SSC Guru Points: 1004343 More actions · Answer 10

cuseman03 wrote:

well the PK alone doesn't indicate real world duplicates as in you can create an index column of sorts but that doesn't solve potential duplicates and even where there is a truly unique PK that still doesn't prevent duplicates as I have a warehouse that ingests data from over 20 sources on a daily basis. Need to make sure the data coming in doesn't have duplicates.

Yep... I pointed out that the PK alone, especially if it's a surrogate key, doesn't do squat in the real world of duplicates and that's why I asked the question of why you want to do this. It sounds like you have a great reason to do what you've asked about but because you have a valid reason, you've indicated a very real problem that ultimately needs to be repaired. That's what I was really getting at. You shouldn't have to do what you're asking to do.

Of course, based on your comment above, you already know that and that's what I wanted to be sure of.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004343 More actions · Answer 11

Jonathan AC Roberts wrote:


SELECT CONCAT('
SELECT ''', QUOTENAME(t.TABLE_SCHEMA), '.', QUOTENAME(t.TABLE_NAME),  ''' [TABLE_NAME], COUNT(*) COUNT 
  FROM (SELECT ',x.c, '
          FROM ',QUOTENAME(t.TABLE_SCHEMA), '.', QUOTENAME(t.TABLE_NAME),'
         GROUP BY ', x.c, '
        HAVING COUNT(*) > 1) y;') SQL
  FROM INFORMATION_SCHEMA.TABLES t
 CROSS APPLY (VALUES (STUFF((SELECT ',' + QUOTENAME(c.COLUMN_NAME)
                               FROM INFORMATION_SCHEMA.COLUMNS c
                              WHERE c.TABLE_SCHEMA = t.TABLE_SCHEMA
                                AND c.TABLE_NAME = t.TABLE_NAME
                              ORDER BY c.ORDINAL_POSITION ASC
                                FOR XML PATH (''), TYPE).value('.', 'varchar(max)'),1,1,''))) x(C)
GO

I don't know what "14-22" is in your request?

You will get it better formatted if you go into the menu option Query/Options and select "Retain CR/LF on copy or save"

Capture

That looks pretty good. We went 1 step further (you're more than half way there) and it turned out to be nasty fast.

Instead of making a monster wide column through concatenation, consider using the same technique to make a hashbytes column and then compare the hashbytes.

Of course, that recommendation also comes with the warning that if the hashbytes are different, then the rows are guaranteed to be different. However (and you know this but I have to say it out loud for others that may read this), if the hashbytes are the same, there is no guarantee that the rows are identical because of the collisions that can occur within any hashing algorithm. After you examined the true differences with hashbytes, then you have to do the kind of check your doing to prove they're identical.

The extra bit of complexity may not be worth it if all you have are short and narrow tables but it's a god-send of performance for the big stuff.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jonathan AC Roberts SSCoach Points: 18354 More actions · Answer 12

Jeff Moden wrote:

Of course, that recommendation also comes with the warning that if the hashbytes are different, then the rows are guaranteed to be different. However (and you know this but I have to say it out loud for others that may read this), if the hashbytes are the same, there is no guarantee that the rows are identical because of the collisions that can occur within any hashing algorithm. After you examined the true differences with hashbytes, then you have to do the kind of check your doing to prove they're identical.

If you use SHA2_256 or SHA2_512 the probability of a hash collision for any two different rows in a table is as good as zero. For example, if you are using SHA2_256 and have 4.8×10^(29) rows in your table then there is a probability of 10^(-18) that there will be at least one clash. These numbers are way outside the realm of the possible. If you are using SHA2_512 then it's even more unlikely, you need to have 1.6×10^(68) rows in your table for there to be a 10^(-18) probability of there being at least one clash.

https://en.wikipedia.org/wiki/Birthday_problem#Probability_table

Jeff Moden SSC Guru Points: 1004343 More actions · Answer 13

Jonathan AC Roberts wrote:

Jeff Moden wrote:
Of course, that recommendation also comes with the warning that if the hashbytes are different, then the rows are guaranteed to be different. However (and you know this but I have to say it out loud for others that may read this), if the hashbytes are the same, there is no guarantee that the rows are identical because of the collisions that can occur within any hashing algorithm. After you examined the true differences with hashbytes, then you have to do the kind of check your doing to prove they're identical.
If you use SHA2_256 or SHA2_512 the probability of a hash collision for any two different rows in a table is as good as zero. For example, if you are using SHA2_256 and have 4.8×10^(29) rows in your table then there is a probability of 10^(-18) that there will be at least one clash. These numbers are way outside the realm of the possible. If you are using SHA2_512 then it's even more unlikely, you need to have 1.6×10^(68) rows in your table for there to be a 10^(-18) probability of there being at least one clash.
https://en.wikipedia.org/wiki/Birthday_problem#Probability_table

Yes. Totally agreed about the probability. If someone wants to trust that ultra-low probability, it'll probably never be wrong but... it's not a guarantee and I just want people to know that.

To wit, I also use unique constraints on Random GUID columns that are supposed to contain unique values in SQL Server even though the probability of a dupe is also incredibly low because they're actually Type 4 UIDs, which are not guaranteed to be globally unique.

This reply was modified 4 years, 11 months ago by Jeff Moden.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jonathan AC Roberts SSCoach Points: 18354 More actions · Answer 14

Jeff Moden wrote:

Jonathan AC Roberts wrote:
Jeff Moden wrote:
Of course, that recommendation also comes with the warning that if the hashbytes are different, then the rows are guaranteed to be different. However (and you know this but I have to say it out loud for others that may read this), if the hashbytes are the same, there is no guarantee that the rows are identical because of the collisions that can occur within any hashing algorithm. After you examined the true differences with hashbytes, then you have to do the kind of check your doing to prove they're identical.
If you use SHA2_256 or SHA2_512 the probability of a hash collision for any two different rows in a table is as good as zero. For example, if you are using SHA2_256 and have 4.8×10^(29) rows in your table then there is a probability of 10^(-18) that there will be at least one clash. These numbers are way outside the realm of the possible. If you are using SHA2_512 then it's even more unlikely, you need to have 1.6×10^(68) rows in your table for there to be a 10^(-18) probability of there being at least one clash.
https://en.wikipedia.org/wiki/Birthday_problem#Probability_table
Yes. Totally agreed about the probability. If someone wants to trust that ultra-low probability, it'll probably never be wrong but... it's not a guarantee and I just want people to know that.
To wit, I also use unique constraints on Random GUID columns that are supposed to contain unique values in SQL Server even though the probability of a dupe is also incredibly low because they're actually Type 4 UIDs, which are not guaranteed to be globally unique.

How do you hash all the columns together in a row? What do you do about null values in columns?