October 26, 2016 at 7:35 am
I am currently looking at identifying duplicate names and looking for behavior patters. A duplicate in this case is very high level and there is risk involved as you will see.
Definition of a duplicate: The same first name, surname and date of birth. Obviously there are genuine different people that hit this criteria (this dataset is approx. 3 million records) but it does not matter at the moment.
I am looking at a data set for the previous week and running a query to identify new people that have been entered into the database. I use a sub query to get this and there are about 757 from last week.
I need to create a SQL script that can take there 757 and cross reference / compare it to the same table to see if any duplicates come up that match the criteria of same first name, surname and data of birth.
I have tried to give it ago but it seems to just show the same records side by side so need to exclude people with matching ID's as this would be pointless.
SELECT O.ow_id,
O.ow_forenames,
O.ow_surname,
O.ow_birthdate,
O2.dp_person,
O2.ow_forenames,
O2.ow_surname,
O2.ow_birthdate
FROM props.dp_owner O
join (SELECT E.dp_person,
O3.ow_forenames,
O3.ow_surname,
O3.ow_birthdate
FROM props.dp_entry E
join props.dp_owner O3
ON E.dp_person = O3.ow_id
WHERE E.dp_date_entered BETWEEN To_date('17-OCT-16') AND To_date(
'23-OCT-16')
AND E.dp_person IS NOT NULL
GROUP BY E.dp_person,
O3.ow_forenames,
O3.ow_surname,
O3.ow_birthdate) O2
ON ( O.ow_forenames = O2.ow_forenames
AND O.ow_surname = O2.ow_surname
AND O.ow_birthdate = O2.ow_birthdate )
Thanks in advance,
Paul
October 26, 2016 at 8:02 am
October 26, 2016 at 8:08 am
dramaqueen (10/26/2016)
I am currently looking at identifying duplicate names and looking for behavior patters. A duplicate in this case is very high level and there is risk involved as you will see.Definition of a duplicate: The same first name, surname and date of birth. Obviously there are genuine different people that hit this criteria (this dataset is approx. 3 million records) but it does not matter at the moment.
I am looking at a data set for the previous week and running a query to identify new people that have been entered into the database. I use a sub query to get this and there are about 757 from last week.
I need to create a SQL script that can take there 757 and cross reference / compare it to the same table to see if any duplicates come up that match the criteria of same first name, surname and data of birth.
I have tried to give it ago but it seems to just show the same records side by side so need to exclude people with matching ID's as this would be pointless.
SELECT O.ow_id,
O.ow_forenames,
O.ow_surname,
O.ow_birthdate,
O2.dp_person,
O2.ow_forenames,
O2.ow_surname,
O2.ow_birthdate
FROM props.dp_owner O
join (SELECT E.dp_person,
O3.ow_forenames,
O3.ow_surname,
O3.ow_birthdate
FROM props.dp_entry E
join props.dp_owner O3
ON E.dp_person = O3.ow_id
WHERE E.dp_date_entered BETWEEN To_date('17-OCT-16') AND To_date(
'23-OCT-16')
AND E.dp_person IS NOT NULL
GROUP BY E.dp_person,
O3.ow_forenames,
O3.ow_surname,
O3.ow_birthdate) O2
ON ( O.ow_forenames = O2.ow_forenames
AND O.ow_surname = O2.ow_surname
AND O.ow_birthdate = O2.ow_birthdate )
Thanks in advance,
Paul
So ... add a line to the end of the query to remove those:
WHERE o.ow_id <> o2.ow_id
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
October 26, 2016 at 8:29 am
Perfect thanks. It worked and was really simple.
Viewing 4 posts - 1 through 3 (of 3 total)
You must be logged in to reply to this topic. Login to reply