January 13, 2016 at 8:52 am
Hi,
I need to identify possible duplicate (not exactly duplicate row) data without primary key:
Table A:
Col1 Col2 Col3 Col4 Col5 Col6
123 ABC XYZ 12-1-2015 76 76.78
123 ABCX XYZ 12-1-2015 76 76.78
So, row 2 is kind of duplicate and need to flag it.
Possible scenario, Any of the column might have some extra characters and need to flag as duplicate rows. In above case Row 2 might be duplicate since Col2 has "ABCX" compare to Col1 "ABC".
Please let me know how to achieve this.
Thanks in advance.
January 13, 2016 at 8:58 am
monilps (1/13/2016)
Hi,I need to identify possible duplicate (not exactly duplicate row) data without primary key:
Table A:
Col1 Col2 Col3 Col4 Col5 Col6
123 ABC XYZ 12-1-2015 76 76.78
123 ABCX XYZ 12-1-2015 76 76.78
So, row 2 is kind of duplicate and need to flag it.
Possible scenario, Any of the column might have some extra characters and need to flag as duplicate rows. In above case Row 2 might be duplicate since Col2 has "ABCX" compare to Col1 "ABC".
Please let me know how to achieve this.
Thanks in advance.
So ... you need to tell us the rule for finding these duplicates. Is it:
Cols 1,3,4,5,6 exactly the same and first three characters of col2 the same?
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
January 13, 2016 at 9:09 am
Yes, So For Row 1 "ABC" is the data and for Row 2 "ABCX" is the data where "X" is additional.
NOTE: For Row 2 to flag as duplicate it should have Row 1 data plus few additional characters.
January 13, 2016 at 9:13 am
Phil is right. If this is any column, you might have a lot of work.
In general, you can use a count() on a column, with that column as a group by, to find duplicates.
CREATE TABLE a (col1 INT, col2 VARCHAR(20), col3 VARCHAR(20))
;
GO
INSERT a
VALUES
(123, 'ABC', 'XYZ'),
(123, 'ABCX', 'XYZ')
GO
SELECT col2, COUNT(col2)
FROM a
GROUP BY col2
HAVING COUNT(col2) > 1
GO
SELECT col3, COUNT(col3)
FROM a
GROUP BY col3
HAVING COUNT(col3) > 1
GO
DROP TABLE dbo.a
However, if you're looking for the beginning values in a column, that's tricky. Unless you have some idea of how many characters are duplicate, you are asking for a lot of checking.
January 13, 2016 at 9:24 am
Yeah, this is for any column.
January 13, 2016 at 9:34 am
monilps (1/13/2016)
Yeah, this is for any column.
OK, this answer contradicts the earlier one.
So ... you want to identify rows where another row exists which is 'similar'.
'Similar' here means that all of the columns bar 1 are identical. The column which is not identical matches on the first three characters. But this could be any of the columns.
Is that more accurate? If that's what you want, this gets very ugly if the number of columns is more than a handful.
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
January 13, 2016 at 9:52 am
For above example Col3 needs to check with Col3 only.
January 13, 2016 at 9:57 am
monilps (1/13/2016)
For above example Col3 needs to check with Col3 only.
I'm sorry, but your vague single-sentence responses are not providing the level of clarity required to answer such a complex requirement.
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
January 13, 2016 at 10:18 am
I know it is complex:
Please see below, hope it is more clear:
Headers Col1 Col2 Col3 Col4 ... Coln
Row1 123 ABC XYZ 12/1/2015 ... qaz
Row2 345 XCV NHT 12/2/2015 ... wsx
Row3 123 ABX XYZ 12/2/2015 ... DFG
Row4 345 XCV NHTXV 12/2/2015 ... wsx
Row5 123 ABCX XYZ 12/1/2015 ... qaz
NOTE: I need to flag Row 4 and Row 5 as possible duplicate.
In order to flag as duplicate all the data should be same except one column and that column should have same data plus additional data.
1. Need to find rows which have similar data
2. Each column data should compare with same column data.
Thanks.
January 13, 2016 at 10:20 am
Even though your data doesn't have a primary key, it should have one or more natural keys. You need to identify what those natural key might be and use those in identifying possible duplicates. It's impossible to identify potential natural keys given your vague data.
If a record has multiple natural keys, you might consider weighting the matches on each of those natural keys and then reviewing pairs of records with the highest total weighted matches.
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
January 13, 2016 at 10:39 am
Is the "base length" of the columns at least the same? That is, when you say the value is the same except chars are added, is the initial value always the same length, such a 3 bytes?
That is, could this be sample data:
col2
abc --all initial values are 3 chars?
def
ghi
defXY
But NOT this:
col2
jklm --4 chars and not 3
jklmXY
Although each column is separate, of course, so this would be fine:
col3
jklm --initial value always 4 chars for this column
opqr
stuv
jklmXY
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
January 13, 2016 at 10:48 am
This fits your example data. But is it too specific for your real-world data?
if OBJECT_ID('tempdb..#dupes') is not null
drop table #dupes;
create table #dupes ( Id int identity (1,1) primary key clustered
, Col1 int
, Col2 Varchar(10)
, Col3 Varchar(10)
, Col4 Date
, Col5 Varchar(10) )
insert #dupes ( Col1, Col2, Col3, Col4, Col5 )
values ( 123, 'ABC', 'XYZ', '20150112', 'qaz' )
, ( 345, 'XCV', 'NHT', '20150212', 'wsx' )
, ( 123, 'ABX', 'XYZ', '20150212', 'DFG' )
, ( 345, 'XCV', 'NHTXV', '20150212', 'wsx' )
, ( 123, 'ABCX', 'XYZ', '20150112', 'qaz' )
select Col1
, Col2 = left(Col2,3)
, Col3
, Col4
, Col5
from #dupes
group by Col1
, left(Col2,3)
, Col3
, Col4
, Col5
having count(1) > 1
union all
select Col1
, Col2
, Col3 = left(Col3,3)
, Col4
, Col5
from #dupes
group by Col1
, Col2
, left(Col3,3)
, Col4
, Col5
having count(1) > 1
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
January 13, 2016 at 11:35 am
Hey Phil,
The query is close by what I am looking for but two things:
- Query is only resulting back original rows which has similar duplicate data.
- What if data is more than 3 characters in the real world which we need to compare.
Thanks for your help.
January 13, 2016 at 11:49 am
monilps (1/13/2016)
Hey Phil,The query is close by what I am looking for but two things:
- Query is only resulting back original rows which has similar duplicate data.
- What if data is more than 3 characters in the real world which we need to compare.
Thanks for your help.
- What do you want to see?
- No problem. Modify the LEFT(string, [number of characters]) construction, where [number of characters] is the number you wish to compare.
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
January 13, 2016 at 12:08 pm
Header Col1 Col2 Col3 Col4 ... Coln IsDuplicate
Row 1 123 ABC XYZ 12/1/2015 ... qaz No
Row 2 345 XCVE NHTX 12/2/2015 ... wsx No
Row 3 123 ABX XYZ 12/2/2015 ... DFG No
Row 4 345 XCVE NHTXVX 12/2/2015 ... wsx Yes
Row 5 123 ABCX XYZ 12/1/2015 ... qaz Yes
Viewing 15 posts - 1 through 15 (of 19 total)
You must be logged in to reply to this topic. Login to reply