filter duplicate students via T-SQL

Question

filter duplicate students via T-SQL

kevin_nikolai

SSCarpal Tunnel

Points: 4343
More actions
April 19, 2013 at 6:52 am

#276692

/*
I need to filter for duplicate students:
---------------------------------------
For same student, if one or both instance/s of IDNo is null then do the following:
1.
if both instances of Tel3 are identical for same Student (Alison), if they are
use it to populate field Sort with the value, 1111.
2.
if both instances of Tel3 are not identical for same Student (Carmen), check if both instances of Tel1 are identical for same Student, if they are use it to populate field Sort with the value, 8888.
3.
if both instances of Tel3 and Tel1 are not identical for same Student (Amber), check if an instance of Tel3 & Tel1 are identical for same Student, if they are use it to populate field Sort with the value, 7777.
Since there is no match between 1st row and 2nd row (except for the name) we can't say it's the
same user for sure, so don't populate the field Sort with any value where row = ( 'Amber' as Student, NULL as IDNo, '5555' as Tel3, '6666' as Tel1 ).
*/
SELECT 'Alison' as Student, NULL as IDNo, '1111' as Tel3, '4444' as Tel1, NULL as Sort UNION ALL
SELECT 'Alison' as Student, '80405088' as IDNo, '1111' as Tel3, '4444' as Tel1, NULL as Sort UNION ALL
SELECT 'Amber' as Student, NULL as IDNo, '5555' as Tel3, '6666' as Tel1, NULL as Sort UNION ALL
SELECT 'Amber' as Student, NULL as IDNo, '7777' as Tel3, '7777' as Tel1, NULL as Sort UNION ALL
SELECT 'Carmen' as Student, NULL as IDNo, '8888' as Tel3, '8888' as Tel1, NULL as Sort UNION ALL
SELECT 'Carmen' as Student, '20083085' as IDNo, '9999' as Tel3, '8888' as Tel1, NULL as Sort
***
Apologies for duplicate posting, when trying to post initially I got DNS errors from Internet Explorer, so re-posted, again DNS errors, etc. After the 3rd attempt my post was successful, only to notice more entries posted.
***

Viewing 15 posts - 1 through 15 (of 23 total)

You must be logged in to reply to this topic. Login to reply

Sean Lange SSC Guru Points: 286590 More actions · Answer 1

This would be a lot easier with ddl (create table statements) and sample data (insert statements) along with the desired output based on the sample data. Please take a few minutes and read the first article in my signature for best practices when posting questions.

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

sdhanpaul Right there with Babe Points: 787 More actions · Answer 2

SELECT EmpPin, trxdate, trxtime, COUNT(*) AS dupes

FROM dbo.rawtrx

GROUP BY EmpPin, trxdate, trxtime

HAVING (COUNT(*) > 1)

play around with this query

sdhanpaul Right there with Babe Points: 787 More actions · Answer 3

okay, after reading further your questions and scenarios... i doubt that my last post would work...

how many times do you have the same student recurring? do you ever have a situation like this?

kyle null 1111 1111

kyle null 1111 1112

kyle 111 1111 1111

kevin_nikolai SSCarpal Tunnel Points: 4343 More actions · Answer 4

sdhanpaul,

Prior to removing duplicate instance of user, I first want to get list of duplicate users, to do that I need to filter them and make sure they really are duplicates.

I can filter them by IDNo - unfortunately more than half of the users with same name (firstname + surname) don't have IDNo, so I am forced to check TelephoneNo3 and TelephoneNo1 fields to try and identify if they really are the same user/s. As in my example above, the first row for user Alison is not identical to the second row for user Alison, same applied to the other users. So using COUNT(*) and HAVING (COUNT(*) > 1) clause won't do the job.

sdhanpaul Right there with Babe Points: 787 More actions · Answer 5

how about if you join the fields?

Try this:

WITH CTE (emppin,trxdate, trxtime, DuplicateCount)

AS

(

SELECT emppin,trxdate, trxtime,

ROW_NUMBER() OVER(PARTITION BY emppin,trxdate, trxtime ORDER BY emppin) AS DuplicateCount

FROM dbo.rawtrx

)

DELETE

FROM CTE

WHERE DuplicateCount > 1

you need to replace the fields etc

this however deletes the duplicates... im really not much of a bottle feeder so you need to do the modifications.

KoldCoffee SSCoach Points: 18971 More actions · Answer 6

I would like to follow this, but I don't understand the sentence "if they are use it to populate field Sort with the value". I don't see the Sort column populated at all for any of the examples.

Did the replies here already help you?

Sean Lange SSC Guru Points: 286590 More actions · Answer 7

Notice how people just keep taking random attempts at helping you. If you would take a few minutes and post ddl, sample data and desired output you would have a tested answer quickly. This really isn't too difficult but without something to work with it is just guessing.

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

Jan Van der Eecken SSCrazy Eights Points: 9029 More actions · Answer 8

Apart from what Sean said (i.e. the missing DDL, which is basically table create statements plus some INSERT statements of sample data to create your scenario), I assume you also have a Tel2 column in that table? Not a good idea, since it makes your database design incredibly inflexible. Read up on Normalization (start with this Wikipedia article or read through Tom Thomson's really great series on Normalization on this site. By the way, is this an exam question?

--------------------------------------------------------------------------
A little knowledge is a dangerous thing (Alexander Pope)
In order for us to help you as efficiently as possible, please read this before posting (courtesy of Jeff Moden)[/url]

Sergiy SSC Guru Points: 110208 More actions · Answer 9

First, you need to have some kind of ID on the records in the table, otherwise each record will be always matched to itself.

Assuming, you have such ID here is a test table (which should be really provided by you in the initial post):

--DROP TABLE #Student

SELECT IDENTITY(int, 1,1) RowID, *

INTO #Student

FROM (

SELECT 'Alison' as Student, NULL as IDNo, '1111' as Tel3, '4444' as Tel1, NULL as Sort UNION ALL

SELECT 'Alison' as Student, '80405088' as IDNo, '1111' as Tel3, '4444' as Tel1, NULL as Sort UNION ALL

SELECT 'Amber' as Student, NULL as IDNo, '5555' as Tel3, '6666' as Tel1, NULL as Sort UNION ALL

SELECT 'Amber' as Student, NULL as IDNo, '7777' as Tel3, '7777' as Tel1, NULL as Sort UNION ALL

SELECT 'Carmen' as Student, NULL as IDNo, '8888' as Tel3, '8888' as Tel1, NULL as Sort UNION ALL

SELECT 'Carmen' as Student, '20083085' as IDNo, '9999' as Tel3, '8888' as Tel1, NULL as Sort

) DT (Student, IDNo, Tel3, Tel1, Sort)

And here is the query returning matching pairs with appropriate Sort value:

SELECT T1.Student, T1.RowID, T1.IDNo, T1.Tel3, T1.Tel1,

CASE

WHEN T2.Tel3 = T1.Tel3 THEN 1111

WHEN T2.Tel1 = T1.Tel1 THEN 8888

WHEN T2.Tel1 = T1.Tel3 OR T2.Tel3 = T1.Tel1 THEN 7777

END Sort

, T2.RowID,T2.IDNo, T2.Tel3, T2.Tel1

FROM #Student T1

INNER JOIN #Student T2 ON T2.Student = T1.Student AND T2.RowID <> T1.RowID AND T2.IDNo IS NULL AND (

T2.Tel3 = T1.Tel3

OR T2.Tel1 = T1.Tel1

OR T2.Tel1 = T1.Tel3 OR T2.Tel3 = T1.Tel1

)

Assuming that it's the lines with IDNo which must be marked as duplicates, I apply the Sort status to T2 entries:

UPDATE T2

SET Sort = CASE

WHEN T2.Tel3 = T1.Tel3 THEN 1111

WHEN T2.Tel1 = T1.Tel1 THEN 8888

WHEN T2.Tel1 = T1.Tel3 OR T2.Tel3 = T1.Tel1 THEN 7777

END

FROM ..

Is it close to what you need?

_____________
Code for TallyGenerator

kevin_nikolai SSCarpal Tunnel Points: 4343 More actions · Answer 10

Hi, perhaps I did not explain myself clearly, so I am trying to fix that.

I need to search a table with user records, find only those users that occur more than once.

I could just use the username (Student) to find the duplicates, but unfortunately half of their IDNo's are NULL, so I it's you cannot say they are duplicates. The only other way to check if they are duplicates is to check Tel3 and Tel1 (to try and check if they are duplicates).

Since are few forum users are asking for ddl, below I have provided it. It works to a certain extent.

And NO, this is not for an exam, interview questionnaire, study topic, etc.

CREATE TABLE #Student (

[Student] [varchar](30) NULL,

[IDNo] [varchar](13) NULL,

[Tel3] [varchar](12) NULL,

[Tel1] [varchar](12) NULL,

[Sort] [varchar](12) NULL

)

INSERT INTO #Student

SELECT 'Alison Stew' as Student, '405088' as IDNo, '+77721377200' as Tel3, '+77721377200' as Tel1, NULL

INSERT INTO #Student

SELECT 'Alison Stew' as Student, NULL as IDNo, '+77721377200' as Tel3, '+77721377200' as Tel1, NULL

INSERT INTO #Student

SELECT 'Amber Egno' as Student, NULL as IDNo, '+7677170700' as Tel3, '+76772476573' as Tel1, NULL

INSERT INTO #Student

SELECT 'Amber Egno' as Student, NULL as IDNo, '+76772476573' as Tel3, '+76772476573' as Tel1, NULL

INSERT INTO #Student

SELECT 'Carmen Schw' as Student, '083085' as IDNo, '+77711498898' as Tel3, '+77117642110' as Tel1, NULL

INSERT INTO #Student

SELECT 'Carmen Schw' as Student, NULL as IDNo, '+77117642110' as Tel3, '+77117642110' as Tel1, NULL

-------------------------------------------------------------------------------------------------

SELECT

[Student]

,IDNo

,Tel1

,Tel3

INTO #FilterDupes

FROM #Student

-------------------------------------------------------------------------------------------------

SELECT

[Student]

--,IDNo

--,Tel1

--,Tel3

,count([Student]) as Dupes

FROM #FilterDupes

GROUP BY

[Student]

--,IDNo

--,Tel1

--,Tel3

HAVING COUNT([Student]) > 1

DROP TABLE #Student

DROP TABLE #FilterDupes

Sergiy SSC Guru Points: 110208 More actions · Answer 11

Now, what is wrong with the approach I posted?

Where does it fail to provide the result you need?

_____________
Code for TallyGenerator

kevin_nikolai SSCarpal Tunnel Points: 4343 More actions · Answer 12

Hi Sergiy, firstly thanks to you and everyone else who responded to my post.

I think I should rather have named the title of my post:

How to find duplicates based on 3 fields ? (being IDNo, Tel1, Tel3 - for Student)

Your script/sql query worked ok in some instances, but in some cases excluded some students, due to Tel3 / Tel1 / IDNo. What makes writing a query for such an issue difficult is that sometimes all 3 (IDNo, Tel1, Tel3) have values, in other cases only 2 of the 3 have values, in other only 1, in other none have a value. To add more pain to the issue, first instance of IDNo / Tel1 / Tel3 for student Jack might be identical to 2nd instance of Jack, the 3rd instance they might all differ, 4th instance some might differ.

Perhaps I should just use a cte -

Student, Tel1 for 1st query.

Then Student, Tel3 for 2nd query.

Then Student, IDNo for 3rd query.

Then join all 3 together, and hope it does the job.

Sergiy SSC Guru Points: 110208 More actions · Answer 13

hmm, it did not quite explain the problem.

At least to me.;-)

Can you please provide some data samples where the script fails and explain what outcome you'd expect from it.

Then I (or others) could try to figure out a solution.

_____________
Code for TallyGenerator

Mansfield SSCrazy Points: 2396 More actions · Answer 14

The factor preventing us from providing an accurate answer is because we need a set of expected results in a tabular format that the query should generate from the sample data. This is the only robust way to describe the required query logic.

Having said, that I'm including a possible answer using the provided DDL:

-- Set up test data

declare @Student table (

[Student] [varchar](30) NULL,

[IDNo] [varchar](13) NULL,

[Tel3] [varchar](12) NULL,

[Tel1] [varchar](12) NULL,

[Sort] [varchar](12) NULL

);

INSERT INTO @Student

SELECT 'Alison Stew' as Student, '405088' as IDNo, '+77721377200' as Tel3, '+77721377200' as Tel1, null;

INSERT INTO @Student

SELECT 'Alison Stew' as Student, NULL as IDNo, '+77721377200' as Tel3, '+77721377200' as Tel1, null;

INSERT INTO @Student

SELECT 'Amber Egno' as Student, NULL as IDNo, '+7677170700' as Tel3, '+76772476573' as Tel1, null;

INSERT INTO @Student

SELECT 'Amber Egno' as Student, NULL as IDNo, '+76772476573' as Tel3, '+76772476573' as Tel1, null;

INSERT INTO @Student

SELECT 'Carmen Schw' as Student, '083085' as IDNo, '+77711498898' as Tel3, '+77117642110' as Tel1, null;

INSERT INTO @Student

SELECT 'Carmen Schw' as Student, NULL as IDNo, '+77117642110' as Tel3, '+77117642110' as Tel1, null;

-- Query

with rules_per_student as (

selectStudent,

sum(case when IDNo is null then 1 else 0 end) as HasNullRows,

case when min(Tel3) = max(Tel3) then 1 else 0 end as IdenticalTel3Rows,

case when min(Tel1) = max(Tel1) then 1 else 0 end as IdenticalTel1Rows,

sum(case when Tel1 = Tel3 then 1 else 0 end) as HasTel1EqualTel3InRow

from @Student

group by Student

)

selects.Student,

s.IDNo,

s.Tel3,

s.Tel1,

case when r.HasNullRows > 0 then

case when IdenticalTel3Rows = 1 then '1111'

when IdenticalTel3Rows = 0 and IdenticalTel1Rows = 1 then '8888'

when IdenticalTel3Rows = 0 and IdenticalTel1Rows = 0 and HasTel1EqualTel3InRow > 0 then '7777'

end

else null end as Sort

from @Student s

join rules_per_student r on r.student = s.student