November 6, 2008 at 5:21 am
I have written the follow query in sql server to match nhs numbers to various patients in a table, zzWembley WIC 2nd QTR 2008 table
SELECT DISTINCT a.NHS_NUMBER, a.DATE_OF_BIRTH, a.POSTCODE, a.FORENAME,
a.SURNAME, b.[Name]
FROM NSTS_PATIENT a INNER JOIN [zzWembley WIC 2nd QTR 2008] b
ON a.DATE_OF_BIRTH = b.[Date Of Birth] AND a.POSTCODE = b.[Post Code]
Whilst this query matches some patients I think I may be missing a few records because the unique identifier for each patient is: [name][Date of Birth] and [Post code]. However I can not add name as a criteria for ON-Clause because of 2 reasons. Firstly the [name] field in one table combines both first and second name whilst the other has one field for forename and another field for surname. Secondly the names may be spelt incorrectly despite being the same patient. How can overcome this problem.
November 6, 2008 at 6:01 am
Do you not have any unique identification field for the patients? Usually in cases such as these, you would have a unique identifier such as PatientID, or PatientNumber that would be the same across tables.
If the answer to that is no, and your data is bad in the only fields that identify the patients, then you need to fix your data. Fixing data like this is usually done i several passes, and for sensitive data such as yours might have to get actual records checked. I'd start by grouping by postal code and DOB, and showing anything with more than one FirstName / Lastname combination over the tables. I'd then do WHERE NOT IN searches both ways on patient names, with and without linking criteria. Basically, do anything you can think of to possibly show where the data might be different. Don't ignore postal codes and DOB's once you fix all the names, because those being wrong will cause the same issues. If you can get this data all cleaned up, you should:
A. Put policies in place to prevent people from putting bad data into the tables
B. Add a unique identifier for each patient so that you have a way to link them even if the name is wrong (preferred).
C. If you can't do either of the first 2, set up all of the checks you did to find all the bad data in the first place in a task which runs every day and sends you an email, or inserts bad lines into a table, or sets a flag on the table... something to know when a row is possibly bad so it can be corrected before your data gets heavily corrupted again.
November 6, 2008 at 7:41 am
It is just a once of thing that i was given to do. I imported the data from an excel file. The reference table contains the correct data i.e names and nhs numbers. However I didn't quite get what you meant by :
grouping by postal code and DOB, and showing anything with more than one FirstName / Lastname combination over the tables.
I'd then do WHERE NOT IN searches both ways on patient names, with and without linking criteria.
Could you just explain that with a brief example. Sorry to be a pain in the arse.
November 6, 2008 at 8:02 am
I'm referring to basic data cleansing. For example
The first query will show you that Michael Jordan has a name Mismatch with Mike Jordan based on DOB and Zip Code. This may be a name mismatch, or it may simply be another patient with matching DOB and Zip codes. The frequency of the latter should be low. The second query will show you which patients have names / zip codes in common with different DOB's.
[font="Courier New"]------------------- SETUP ---------------------------------
DECLARE @A TABLE(
Name VARCHAR(40),
Zip INT,
DOB DATETIME)
DECLARE @B TABLE(
FirstName VARCHAR(20),
LastName VARCHAR(20),
Zip INT,
DOB DATETIME)
INSERT INTO @A(Name, Zip, DOB)
SELECT 'Michael Jordan',12345,'12/05/68' UNION ALL
SELECT 'Magic Johnson',12543,'12/05/69' UNION ALL
SELECT 'Larry Bird',12243,'12/05/70'
INSERT INTO @B(FirstName, LastName, Zip, DOB)
SELECT 'Mike','Jordan',12345,'12/05/68' UNION ALL
SELECT 'Magic','Johnson',12543,'12/05/69' UNION ALL
SELECT 'Larry','Bird',12243,'12/05/71'
-----------------------------------------------------------
------------- Find Mismatched Names -----------------------
SELECT A.Name, B.Name
FROM @A A
INNER JOIN ( SELECT FirstName + ' ' + LastName Name, Zip, DOB
FROM @B) B
ON A.Zip = B.Zip AND A.DOB = B.DOB
WHERE A.Name <> B.Name
GROUP BY A.Name, B.Name
-------------------- RESULTS ------------------------------
--A.NameB.Name
--Michael JordanMike Jordan
-----------------------------------------------------------
------------- Find Mismatched DOB's -----------------------
SELECT A.Name, A.DOB, B.Name, B.DOB
FROM @A A
INNER JOIN ( SELECT FirstName + ' ' + LastName Name, Zip, DOB
FROM @B) B
ON A.Zip = B.Zip AND A.Name = B.Name
WHERE A.DOB <> B.DOB
GROUP BY A.Name, B.Name, A.DOB, B.DOB
-------------------- RESULTS ------------------------------
--A.NameA.DOBB.NameB.DOB
--Larry Bird1970-12-05 00:00:00.000Larry Bird1971-12-05 00:00:00.000
-----------------------------------------------------------
[/font]
November 6, 2008 at 8:09 am
Thank you very much. I think i should be able to work it from there. That is makes a lot of sense.
November 6, 2008 at 8:12 am
November 6, 2008 at 9:55 am
Just one more thing. I was just playing around with script similar to yours for finding mismatched names. when I use the GROUP BY clause I get 1491 records. When I leave it out I get 1495 records. I just wanted to know why the GROUP BY clause did not return the extra 4 records. What specfically is the GROUP BY clause doing.
November 6, 2008 at 11:46 am
Group by is normally used for aggregates. To be honest, I don't need it in those queries, I just wrote them in a hurry and had originally planned on doing a count. I deleted the count at the last minute and forgot to remove the group by. In your instance, group by is basically removing your duplicates.
Example of how it works:
Customer Sales
1 10.00
1 20.00
2 25.00
2 10.00
SELECT Customer, SUM(Sales)
FROM MyTable
GROUP BY Customer
RESULT
Customer Sales
1 30.00
2 35.00
Viewing 8 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply