April 19, 2013 at 6:52 am
/*
I need to filter for duplicate students:
---------------------------------------
For same student, if one or both instance/s of IDNo is null then do the following:
1.
if both instances of Tel3 are identical for same Student (Alison), if they are
use it to populate field Sort with the value, 1111.
2.
if both instances of Tel3 are not identical for same Student (Carmen), check if both instances of Tel1 are identical for same Student, if they are use it to populate field Sort with the value, 8888.
3.
if both instances of Tel3 and Tel1 are not identical for same Student (Amber), check if an instance of Tel3 & Tel1 are identical for same Student, if they are use it to populate field Sort with the value, 7777.
Since there is no match between 1st row and 2nd row (except for the name) we can't say it's the
same user for sure, so don't populate the field Sort with any value where row = ( 'Amber' as Student, NULL as IDNo, '5555' as Tel3, '6666' as Tel1 ).
*/
SELECT 'Alison' as Student, NULL as IDNo, '1111' as Tel3, '4444' as Tel1, NULL as Sort UNION ALL
SELECT 'Alison' as Student, '80405088' as IDNo, '1111' as Tel3, '4444' as Tel1, NULL as Sort UNION ALL
SELECT 'Amber' as Student, NULL as IDNo, '5555' as Tel3, '6666' as Tel1, NULL as Sort UNION ALL
SELECT 'Amber' as Student, NULL as IDNo, '7777' as Tel3, '7777' as Tel1, NULL as Sort UNION ALL
SELECT 'Carmen' as Student, NULL as IDNo, '8888' as Tel3, '8888' as Tel1, NULL as Sort UNION ALL
SELECT 'Carmen' as Student, '20083085' as IDNo, '9999' as Tel3, '8888' as Tel1, NULL as Sort
***
Apologies for duplicate posting, when trying to post initially I got DNS errors from Internet Explorer, so re-posted, again DNS errors, etc. After the 3rd attempt my post was successful, only to notice more entries posted.
***
April 19, 2013 at 7:23 am
This would be a lot easier with ddl (create table statements) and sample data (insert statements) along with the desired output based on the sample data. Please take a few minutes and read the first article in my signature for best practices when posting questions.
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
April 19, 2013 at 7:49 pm
SELECT EmpPin, trxdate, trxtime, COUNT(*) AS dupes
FROM dbo.rawtrx
GROUP BY EmpPin, trxdate, trxtime
HAVING (COUNT(*) > 1)
play around with this query
April 19, 2013 at 7:55 pm
okay, after reading further your questions and scenarios... i doubt that my last post would work...
how many times do you have the same student recurring? do you ever have a situation like this?
kyle null 1111 1111
kyle null 1111 1112
kyle 111 1111 1111
April 20, 2013 at 4:41 am
sdhanpaul,
Prior to removing duplicate instance of user, I first want to get list of duplicate users, to do that I need to filter them and make sure they really are duplicates.
I can filter them by IDNo - unfortunately more than half of the users with same name (firstname + surname) don't have IDNo, so I am forced to check TelephoneNo3 and TelephoneNo1 fields to try and identify if they really are the same user/s. As in my example above, the first row for user Alison is not identical to the second row for user Alison, same applied to the other users. So using COUNT(*) and HAVING (COUNT(*) > 1) clause won't do the job.
April 21, 2013 at 5:26 pm
how about if you join the fields?
Try this:
WITH CTE (emppin,trxdate, trxtime, DuplicateCount)
AS
(
SELECT emppin,trxdate, trxtime,
ROW_NUMBER() OVER(PARTITION BY emppin,trxdate, trxtime ORDER BY emppin) AS DuplicateCount
FROM dbo.rawtrx
)
DELETE
FROM CTE
WHERE DuplicateCount > 1
you need to replace the fields etc
this however deletes the duplicates... im really not much of a bottle feeder so you need to do the modifications.
April 21, 2013 at 10:06 pm
I would like to follow this, but I don't understand the sentence "if they are use it to populate field Sort with the value". I don't see the Sort column populated at all for any of the examples.
Did the replies here already help you?
April 22, 2013 at 7:40 am
Notice how people just keep taking random attempts at helping you. If you would take a few minutes and post ddl, sample data and desired output you would have a tested answer quickly. This really isn't too difficult but without something to work with it is just guessing.
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
April 22, 2013 at 3:26 pm
Apart from what Sean said (i.e. the missing DDL, which is basically table create statements plus some INSERT statements of sample data to create your scenario), I assume you also have a Tel2 column in that table? Not a good idea, since it makes your database design incredibly inflexible. Read up on Normalization (start with this Wikipedia article or read through Tom Thomson's really great series on Normalization on this site. By the way, is this an exam question?
April 22, 2013 at 9:10 pm
First, you need to have some kind of ID on the records in the table, otherwise each record will be always matched to itself.
Assuming, you have such ID here is a test table (which should be really provided by you in the initial post):
--DROP TABLE #Student
SELECT IDENTITY(int, 1,1) RowID, *
INTO #Student
FROM (
SELECT 'Alison' as Student, NULL as IDNo, '1111' as Tel3, '4444' as Tel1, NULL as Sort UNION ALL
SELECT 'Alison' as Student, '80405088' as IDNo, '1111' as Tel3, '4444' as Tel1, NULL as Sort UNION ALL
SELECT 'Amber' as Student, NULL as IDNo, '5555' as Tel3, '6666' as Tel1, NULL as Sort UNION ALL
SELECT 'Amber' as Student, NULL as IDNo, '7777' as Tel3, '7777' as Tel1, NULL as Sort UNION ALL
SELECT 'Carmen' as Student, NULL as IDNo, '8888' as Tel3, '8888' as Tel1, NULL as Sort UNION ALL
SELECT 'Carmen' as Student, '20083085' as IDNo, '9999' as Tel3, '8888' as Tel1, NULL as Sort
) DT (Student, IDNo, Tel3, Tel1, Sort)
And here is the query returning matching pairs with appropriate Sort value:
SELECT T1.Student, T1.RowID, T1.IDNo, T1.Tel3, T1.Tel1,
CASE
WHEN T2.Tel3 = T1.Tel3 THEN 1111
WHEN T2.Tel1 = T1.Tel1 THEN 8888
WHEN T2.Tel1 = T1.Tel3 OR T2.Tel3 = T1.Tel1 THEN 7777
END Sort
, T2.RowID,T2.IDNo, T2.Tel3, T2.Tel1
FROM #Student T1
INNER JOIN #Student T2 ON T2.Student = T1.Student AND T2.RowID <> T1.RowID AND T2.IDNo IS NULL AND (
T2.Tel3 = T1.Tel3
OR T2.Tel1 = T1.Tel1
OR T2.Tel1 = T1.Tel3 OR T2.Tel3 = T1.Tel1
)
Assuming that it's the lines with IDNo which must be marked as duplicates, I apply the Sort status to T2 entries:
UPDATE T2
SET Sort = CASE
WHEN T2.Tel3 = T1.Tel3 THEN 1111
WHEN T2.Tel1 = T1.Tel1 THEN 8888
WHEN T2.Tel1 = T1.Tel3 OR T2.Tel3 = T1.Tel1 THEN 7777
END
FROM ..
Is it close to what you need?
_____________
Code for TallyGenerator
April 23, 2013 at 12:51 am
Hi, perhaps I did not explain myself clearly, so I am trying to fix that.
I need to search a table with user records, find only those users that occur more than once.
I could just use the username (Student) to find the duplicates, but unfortunately half of their IDNo's are NULL, so I it's you cannot say they are duplicates. The only other way to check if they are duplicates is to check Tel3 and Tel1 (to try and check if they are duplicates).
Since are few forum users are asking for ddl, below I have provided it. It works to a certain extent.
And NO, this is not for an exam, interview questionnaire, study topic, etc.
CREATE TABLE #Student (
[Student] [varchar](30) NULL,
[IDNo] [varchar](13) NULL,
[Tel3] [varchar](12) NULL,
[Tel1] [varchar](12) NULL,
[Sort] [varchar](12) NULL
)
INSERT INTO #Student
SELECT 'Alison Stew' as Student, '405088' as IDNo, '+77721377200' as Tel3, '+77721377200' as Tel1, NULL
INSERT INTO #Student
SELECT 'Alison Stew' as Student, NULL as IDNo, '+77721377200' as Tel3, '+77721377200' as Tel1, NULL
INSERT INTO #Student
SELECT 'Amber Egno' as Student, NULL as IDNo, '+7677170700' as Tel3, '+76772476573' as Tel1, NULL
INSERT INTO #Student
SELECT 'Amber Egno' as Student, NULL as IDNo, '+76772476573' as Tel3, '+76772476573' as Tel1, NULL
INSERT INTO #Student
SELECT 'Carmen Schw' as Student, '083085' as IDNo, '+77711498898' as Tel3, '+77117642110' as Tel1, NULL
INSERT INTO #Student
SELECT 'Carmen Schw' as Student, NULL as IDNo, '+77117642110' as Tel3, '+77117642110' as Tel1, NULL
-------------------------------------------------------------------------------------------------
SELECT
[Student]
,IDNo
,Tel1
,Tel3
INTO #FilterDupes
FROM #Student
-------------------------------------------------------------------------------------------------
SELECT
[Student]
--,IDNo
--,Tel1
--,Tel3
,count([Student]) as Dupes
FROM #FilterDupes
GROUP BY
[Student]
--,IDNo
--,Tel1
--,Tel3
HAVING COUNT([Student]) > 1
DROP TABLE #Student
DROP TABLE #FilterDupes
April 23, 2013 at 4:19 am
Now, what is wrong with the approach I posted?
Where does it fail to provide the result you need?
_____________
Code for TallyGenerator
April 23, 2013 at 4:08 pm
Hi Sergiy, firstly thanks to you and everyone else who responded to my post.
I think I should rather have named the title of my post:
How to find duplicates based on 3 fields ? (being IDNo, Tel1, Tel3 - for Student)
Your script/sql query worked ok in some instances, but in some cases excluded some students, due to Tel3 / Tel1 / IDNo. What makes writing a query for such an issue difficult is that sometimes all 3 (IDNo, Tel1, Tel3) have values, in other cases only 2 of the 3 have values, in other only 1, in other none have a value. To add more pain to the issue, first instance of IDNo / Tel1 / Tel3 for student Jack might be identical to 2nd instance of Jack, the 3rd instance they might all differ, 4th instance some might differ.
Perhaps I should just use a cte -
Student, Tel1 for 1st query.
Then Student, Tel3 for 2nd query.
Then Student, IDNo for 3rd query.
Then join all 3 together, and hope it does the job.
April 23, 2013 at 6:46 pm
hmm, it did not quite explain the problem.
At least to me.;-)
Can you please provide some data samples where the script fails and explain what outcome you'd expect from it.
Then I (or others) could try to figure out a solution.
_____________
Code for TallyGenerator
April 23, 2013 at 9:10 pm
The factor preventing us from providing an accurate answer is because we need a set of expected results in a tabular format that the query should generate from the sample data. This is the only robust way to describe the required query logic.
Having said, that I'm including a possible answer using the provided DDL:
-- Set up test data
declare @Student table (
[Student] [varchar](30) NULL,
[IDNo] [varchar](13) NULL,
[Tel3] [varchar](12) NULL,
[Tel1] [varchar](12) NULL,
[Sort] [varchar](12) NULL
);
INSERT INTO @Student
SELECT 'Alison Stew' as Student, '405088' as IDNo, '+77721377200' as Tel3, '+77721377200' as Tel1, null;
INSERT INTO @Student
SELECT 'Alison Stew' as Student, NULL as IDNo, '+77721377200' as Tel3, '+77721377200' as Tel1, null;
INSERT INTO @Student
SELECT 'Amber Egno' as Student, NULL as IDNo, '+7677170700' as Tel3, '+76772476573' as Tel1, null;
INSERT INTO @Student
SELECT 'Amber Egno' as Student, NULL as IDNo, '+76772476573' as Tel3, '+76772476573' as Tel1, null;
INSERT INTO @Student
SELECT 'Carmen Schw' as Student, '083085' as IDNo, '+77711498898' as Tel3, '+77117642110' as Tel1, null;
INSERT INTO @Student
SELECT 'Carmen Schw' as Student, NULL as IDNo, '+77117642110' as Tel3, '+77117642110' as Tel1, null;
-- Query
with rules_per_student as (
selectStudent,
sum(case when IDNo is null then 1 else 0 end) as HasNullRows,
case when min(Tel3) = max(Tel3) then 1 else 0 end as IdenticalTel3Rows,
case when min(Tel1) = max(Tel1) then 1 else 0 end as IdenticalTel1Rows,
sum(case when Tel1 = Tel3 then 1 else 0 end) as HasTel1EqualTel3InRow
from @Student
group by Student
)
selects.Student,
s.IDNo,
s.Tel3,
s.Tel1,
case when r.HasNullRows > 0 then
case when IdenticalTel3Rows = 1 then '1111'
when IdenticalTel3Rows = 0 and IdenticalTel1Rows = 1 then '8888'
when IdenticalTel3Rows = 0 and IdenticalTel1Rows = 0 and HasTel1EqualTel3InRow > 0 then '7777'
end
else null end as Sort
from @Student s
join rules_per_student r on r.student = s.student
Viewing 15 posts - 1 through 15 (of 23 total)
You must be logged in to reply to this topic. Login to reply