August 10, 2011 at 1:31 am
What I am looking for is an easy simple way of querying if the entire row, is duplicated,
i.e.
For this table, are there 2 rows where every single column value is duplicated between those 2 rows.
Is there any simple syntax where the user does not have to explicitly list all column values of the table?
I know of this tightly coupled laborious way
select a,b,c,d,...............................................zzz, count(*)
from tbltest
group by a,b,c,d,.......................................zzz
The reason I ask for something without column names is I want to create a loop which uses objectid, and it cannot be tightly coupled to each table if I want it to run against the entire database.
Or at least, just replace the query name with a new table name without having to reconstruct the query for each table in the entire db.
About 250 tables in total.
August 10, 2011 at 1:38 am
August 10, 2011 at 1:59 am
Thanks for the response.
Yep, as I specified in my query, that is how it is done, with a particular query for a particular table.
What I am looking for is a way to query a table without having to hard code column names into the query, and labourously list all column names.
This should allow the user to replace the table name only. (as could be done in a loop)
Specifically, I would like to avoid rebuilding the select and group by each time for a new table.
Imagine writing 250 unique queries for 250 unique tables to see which of those unique 250 tables contain duplicate rows.
That is what I want to avoid.
Some tables have 501 columns 🙁
August 10, 2011 at 2:15 am
sp_msforeachtable '
with cte as (
select row_number() Over (partition by CHECKSUM(*) order by CHECKSUM(*)) as rowcount1 ,* from ?
)
select * from cte
where rowcount1 >1'
I think this should work for you , however there are limitations to using Checksum on tables with a very large number of rows. Please look this up in MSDN before you run this. Also I depending on the table and data there could be a performance hit
August 10, 2011 at 7:31 am
August 10, 2011 at 6:14 pm
Gosh... I wouldn't use CHECKSUM for such a thing. It's nothing but a simple Exclusive OR and getting duplicate CHECKSUM values for different data is much more likely than folks would think. My recommendation would be to build a little dynamic SQL that would write the classic GROUP BY method for finding duplicate rows for each table.
As a side bar, I obviously don't know the condition of your tables but, at the risk of sounding a bit like Celko, unless it's a staging table, you have a much bigger problem if you have identical rows in your tables.
--Jeff Moden
Change is inevitable... Change for the better is not.
August 10, 2011 at 8:10 pm
It's very easy to construct an SQL string containing the column list from system views and execute it.
For example, if your database were called "playpen" this stored procedure would do the job (called by
exec playpen.dbo.finddups 'sch', 'table' to list the rows occurring multiple times in the table playpen.sch.table)
use playpen
go
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'dbofinddups') AND type in (N'P', N'PC'))
DROP PROC dbo.finddups
go
create procedure dbo.finddups
@schema sysname,
@table sysname
as
declare @sql varchar(max) = ''
select @sql = @sql+','+C.name from playpen.sys.columns C
inner join
(select name, object_id, schema_id from playpen.sys.tables where name = @table) T
on T.object_id = C.object_id
inner join (select name, schema_id from playpen.sys.schemas where name = @schema) S
on S.schema_id = T.schema_id
option (MAXDOP 1)
set @sql = 'select count(*), * from '+@schema+'.'+@table+
' group by '+SUBSTRING(@sql,2,len(@sql))+' having count(*) > 1'
exec(@sql)
go
Obviously if you wanted to handle all tables in one call the table name and schema would be output instead of input parameters, and the thing could loop over sys.tables. The code above is not what you would use - but it shows how to do it.
Tom
August 10, 2011 at 9:01 pm
My only concern would be if any of the tables contained columns defined as text or ntext.
August 10, 2011 at 9:31 pm
Lynn Pettis (8/10/2011)
My only concern would be if any of the tables contained columns defined as text or ntext.
Well spotted, Lynn! if there are such columns spotting duplicates is much more difficult.
In fact it's quite nasty. One can exclude those columns from the grouping, and eliminate some of the rows from the possible duplicates list that way (anything unique on those columns is unique on the whole set). But checksum (or binary checksum) can't be used to eliminate more rows because they don't support text/ntext, so I guess the next step would be to use hashbytes on initial substrings of the long columns and hope that eliminated some more rows, and write code to compare the things byte by agonising byte to eliminate any remaining false positives. I wouldn't want to have to do that. No indeed. Too much like hard work, and I can imagine that running it would be tedious, somewhere in between watching paint dry and watching grass grow, if the earlier steps had eliminated almost all rows as possible duplicates.
Tom
August 11, 2011 at 2:26 am
Thanks Tom
This is the direction I wanted to go in.
At worst, I would then replace table name and run your dynamic query together with information schema.
At best, I can use your dynamic query with a sp_MSforeachtable.
Off to play around with your suggestion!!:w00t:
August 11, 2011 at 2:54 am
a method i've always used and works for me is this:
with t (rn, col1, col2) as
(
select
rn= row_number() over (partition by col1, col2, etc order by col1, col2, etc),
col1, col2, etc
from tablename
)
select * from t where rn > 1
--delete t where rn > 1
This is especially useful is you don't have any unique identifier on the row and need to remove any duplicates.
_________________________________________________________________________________SQLGeordieWeb:- Jarrin ConsultancyBlog:- www.chrisjarrintaylor.co.ukTwitter:- @SQLGeordie
August 11, 2011 at 3:02 am
Thanks tom
Only really had to add
SET @sql = ''
select @sql = @sql+','+'['+C.name+']' from sys.columns C
to take care of columns with spaces in the name and now I'm good to go.
I have checked the results against tables which I have already drudged through
So to all that need this, here is Tom's solution, which should work immediately.
You need to pass schema and table name as the parameters.
Use MyDataBase_Here
Go
create procedure dbo.finddups
@schema sysname,
@table sysname
as
declare @sql varchar(max)
SET @sql = ''
select @sql = @sql+','+'['+C.name+']' from sys.columns C
inner join
(select name, object_id, schema_id from sys.tables where name = @table) T
on T.object_id = C.object_id
inner join (select name, schema_id from sys.schemas where name = @schema) S
on S.schema_id = T.schema_id
option (MAXDOP 1)
set @sql = 'select count(*),* from '+@schema+'.'+@table+
' group by '+SUBSTRING(@sql,2,len(@sql))+' having count(*) > 1'
exec(@sql)
GO
August 11, 2011 at 3:53 am
In addition to Tom's solution, there is another modification here to return rows where the Identity for the table has been excluded from the comparison.
create procedure dbo.finddups
@schema sysname,
@table sysname
as
declare @sql varchar(max)
SET @sql = ''
select @sql = @sql+','+'['+C.name+']' from sys.columns C
inner join
(select name, object_id, schema_id from sys.tables where name = @table) T
on T.object_id = C.object_id AND c.is_identity = 0
inner join (select name, schema_id from sys.schemas where name = @schema) S
on S.schema_id = T.schema_id
option (MAXDOP 1)
set @sql = 'select count(*),'+SUBSTRING(@sql,2,len(@sql))+' from '+@schema+'.'+@table+
' group by '+SUBSTRING(@sql,2,len(@sql))+' having count(*) > 1'
exec(@sql)
GO
August 11, 2011 at 8:16 am
ChrisTaylor (8/11/2011)
a method i've always used and works for me is this:
with t (rn, col1, col2) as
(
select
rn= row_number() over (partition by col1, col2, etc order by col1, col2, etc),
col1, col2, etc
from tablename
)
select * from t where rn > 1
--delete t where rn > 1
This is especially useful is you don't have any unique identifier on the row and need to remove any duplicates.
I would try this with ORDER BY SELECT NULL instead of by each column. Wonder if a) you CAN do that and b) if it makes a difference. I would expect it would, but could be wrong or the benefit is minimal.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
August 11, 2011 at 3:28 pm
MadTester (8/11/2011)
In addition to Tom's solution, there is another modification here to return rows where the Identity for the table has been excluded from the comparison.
Nice change. You could also do something similar to exclude any columns that are in a key constraint.
Viewing 15 posts - 1 through 15 (of 28 total)
You must be logged in to reply to this topic. Login to reply