June 2, 2015 at 9:09 am
Hi, hope I can get some assistance with this problem as I'm trying to avoid a large amount of manual data manipulation.
Here's the background: Legacy system that has (well let's call apples apples) pretty much no method of enforcing data integrity, which has caused a fairly decent amount of garbage data to be inserted in some tables. Pulling one of the [Individuals] table from within this Legacy system and inserting it into a production system, into the Table schema currently in place to track [Individuals] in this Production system.
Problem: Inserting the information is easy, what I'm needing assistance with is how to deduplicate the records that exist within the staging table that the legacy [Individuals] table has been dumped into in production, prior to insertion. (Wanting to do this programmatically with SQL or SSIS preferably, so that I can alter it later to allow for updating existing/inserting new)
Staging Table Schema:
;
CREATE TABLE [dbo].[stage_Individuals](
[SysID] [int] NULL, --Unique, though it's not an index intended to identify the [Individuals]
[JJISID] [nvarchar](10) NULL,
[NameLast] [nvarchar](30) NULL,
[NameFirst] [nvarchar](30) NULL,
[NameMiddle] [nvarchar](30) NULL,
[NameSuffix] [nvarchar](10) NULL,
[PrimaryRace] [nvarchar](6) NULL,
[Ethncity] [nvarchar](6) NULL,
[SSN] [nvarchar](11) NULL,
[DOB] [datetime] NULL,
[DrLicNumber] [nvarchar](20) NULL,
[DrLicState] [int] NULL,
[HeightInches] [int] NULL,
[Feet] [int] NULL,
[Inches] [int] NULL,
[Weight] [int] NULL,
[HairColor] [nvarchar](6) NULL,
[Complexion] [nvarchar](6) NULL,
[EyeColor] [nvarchar](6) NULL
Scenario: There are records that duplicate the JJISID, though this value is supposed to be unique for every individual. The SYSID is just a Clustered Index (I'm assuming) within the Legacy system and will be most likely dropped when inserted into the Production [Inviduals] table. There are records that are missing their JJISID, though this isn't supposed to happen either, but have valid information within SSN/DOB/Name/etc that can be merged into the correct record that has a JJISID assigned. There is really no data conformity, some records have NULLS for everything except JJISID, or some records will have all the [Individuals] information excluding the JJISID.
Currently I am running the following SQL just to get a list of the records that have a duplicate JJISID (I have other's that partition by Name/DOB/etc and will adapt whatever I come up with to be used for those as well):
;
select j.*
from (select ROW_NUMBER() OVER (PARTITION BY JJISID ORDER BY JJISID) as RowNum, stage_Individuals.*, COUNT(*) OVER (partition by jjisid) as cnt from stage_Individuals) as j
where cnt > 1 and j.JJISID is not null
Now, with SQL Server 2012 or later I could use LAG and LEAD w/ the RowNum value to do my data manipulation...but that won't work because we are on SQL Server 2008 in this environment.
I found a very helpful Post @ http://www.sqlservercentral.com/Forums/Topic866172-338-1.aspx
With, the following as a potential solution:
GSquared (3/16/2010)
Here's a query that seems to do what you need. Try it, let me know if it works.Performance on it will be a problem, but I can't fine tune that. You'll need to look at various method for getting this kind of data from the table and work out which variation will be best for your data. Without access to the actual table, I can't do that.
;
WITH CTE
AS (SELECT master_id,
MIN(ID) AS first_id,
MAX(Account_Expiry) AS latest_expiry
FROM #People
GROUP BY master_id)
SELECT P1.master_id,
P1.[First],
P1.[Last],
CASE WHEN EXISTS ( SELECT 1
FROM #People P3
WHERE master_id = P1.master_id
AND Likes_fish = 1 ) THEN 1
ELSE 0
END AS Likes_Fish,
CASE WHEN EXISTS ( SELECT 1
FROM #People P3
WHERE master_id = P1.master_id
AND Likes_meat = 1 ) THEN 1
ELSE 0
END AS Likes_Meat,
CASE WHEN EXISTS ( SELECT 1
FROM #People P3
WHERE master_id = P1.master_id
AND Drinks_Alcohol = 1 ) THEN 1
ELSE 0
END AS Drinks_Alcohol,
(SELECT TOP 1
Gender
FROM #People P6
WHERE Master_id = P1.Master_id
AND Gender IN (N'M', N'F')) AS Gender,
CTE.latest_expiry,
(SELECT TOP 1
Account_Number
FROM #People P7
WHERE master_id = P1.Master_id
AND Account_Expiry = CTE.latest_expiry) AS Account_Number
FROM #People P1
INNER JOIN CTE
ON P1.master_id = CTE.master_id
AND P1.ID = CTE.first_id;
Unfortunately, I don't think that will accomplish what I'm looking for - I have some records that are duplicated 6 times, and I'm wanting to keep the values within these that aren't NULL.
Basically what I'm looking for, is to update any column with a NULL value to the corresponding Duplicate [Individauls] record value for that column.
**EDIT - Example, Record 1 has a JJISID with NULL NameFirst & NameLast BUT Record 2 has the same JJISID and values for NameFirst & NameLast. I'm wanting to propogate the NameFirst & NameLast from Record2 into Record1
Any idears?
June 2, 2015 at 10:16 am
Maybe something along the lines below. You could save the results into another table to come back and update the main table with the consolidated data.
You could also add:
COUNT(DISTINCT CASE WHEN <column_name> > '' THEN <column_name> END)
for any column(s) for which you need to know how many different values there were.
SELECT
JJISID,
MAX(SysID) AS SysID,
MIN(SysID) AS SysID_Min, --to test for multiple values
MAX(NameLast) AS NameLast,
MAX(DOB) AS DOB,
MIN(DOB) AS DOB_Min, --to test for multiple values
--INTO #consolidated_Individuals
FROM dbo.stage_Individuals
GROUP BY
JJISID
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
June 2, 2015 at 10:54 am
Hadn't thought of COUNT(DISTINCT CASE WHEN <column_name> > '' THEN <column_name> END) for the count of values within a possible column, I like that!
I was thinking of using MIN & MAX already, I just don't like how it handles TEXT values so I was thinking I'll have to use a LEN aggregate function to pass along those values with a longer LEN than their duplicate records with a corresponding value in that column...but I'm having a hard time figuring out how to do so. Of course in this case M would collate as the MIN value when compared to MAX Michael, but if we run into an instance where Middle Initials is 'N' on one duplicate and 'M' on the other and I actually wanted to keep 'M', then using MAX for all records to pass along NameMiddle would cause 'N' to be passed instead of 'M'.
June 2, 2015 at 11:59 am
Dougieson (6/2/2015)
Hadn't thought of COUNT(DISTINCT CASE WHEN <column_name> > '' THEN <column_name> END) for the count of values within a possible column, I like that!I was thinking of using MIN & MAX already, I just don't like how it handles TEXT values so I was thinking I'll have to use a LEN aggregate function to pass along those values with a longer LEN than their duplicate records with a corresponding value in that column...but I'm having a hard time figuring out how to do so. Of course in this case M would collate as the MIN value when compared to MAX Michael, but if we run into an instance where Middle Initials is 'N' on one duplicate and 'M' on the other and I actually wanted to keep 'M', then using MAX for all records to pass along NameMiddle would cause 'N' to be passed instead of 'M'.
But how do you know you want to keep "M" rather than "N"? That seems like a specific override, which you'd have to do by hand in any case. You could also do a MAX(LEN(<column_name>)) if/as needed.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
June 2, 2015 at 1:14 pm
ScottPletcher (6/2/2015)
Dougieson (6/2/2015)
Hadn't thought of COUNT(DISTINCT CASE WHEN <column_name> > '' THEN <column_name> END) for the count of values within a possible column, I like that!I was thinking of using MIN & MAX already, I just don't like how it handles TEXT values so I was thinking I'll have to use a LEN aggregate function to pass along those values with a longer LEN than their duplicate records with a corresponding value in that column...but I'm having a hard time figuring out how to do so. Of course in this case M would collate as the MIN value when compared to MAX Michael, but if we run into an instance where Middle Initials is 'N' on one duplicate and 'M' on the other and I actually wanted to keep 'M', then using MAX for all records to pass along NameMiddle would cause 'N' to be passed instead of 'M'.
But how do you know you want to keep "M" rather than "N"? That seems like a specific override, which you'd have to do by hand in any case. You could also do a MAX(LEN(<column_name>)) if/as needed.
Exactly, I was trying to come up with a method of consolidating the information without eliminating the values that I want to keep as unique record sets - so on the consolidated table there would still be records I'd have to manually edit/delete.
Thinking about it, I could group by both columns to come up with a composite unique value....hmmm something to think on.
Fortunately, I THINK, on the TEXT columns that I was concerned with MIN/MAX it won't be a problem - filtering through them so far it appears as if the only values I want to pass through will collate out to MAX in each instance so I can use it...still not pretty though.
On a lighter note, my favorite garbage data so far has been DrLicNumber='Yes' or the even better DrLicNumber='Has One'
June 3, 2015 at 2:06 am
I've worked on a similar process; we had to identify clients across multiple 'unique' identifiers who were actually the same person and where some records had more data populated than others. The method I used was to write the details of each matching pair to a separate table along with a 'confidence' score in the match. You'd have to base this on your knowledge of the data, but I gave high scores to matches on National Insurance number, phone number and email address; lower scores to matches just on name (although the score increased with each forename, ie. 3 out of 3 forenames is way better than 1 out of 1) - same with initials. Someone still has to eyeball the results, but you might be able to pick a confidence level above which you always assume the match is correct.
June 3, 2015 at 6:19 am
Chris Wooding (6/3/2015)
I've worked on a similar process; we had to identify clients across multiple 'unique' identifiers who were actually the same person and where some records had more data populated than others. The method I used was to write the details of each matching pair to a separate table along with a 'confidence' score in the match. You'd have to base this on your knowledge of the data, but I gave high scores to matches on National Insurance number, phone number and email address; lower scores to matches just on name (although the score increased with each forename, ie. 3 out of 3 forenames is way better than 1 out of 1) - same with initials. Someone still has to eyeball the results, but you might be able to pick a confidence level above which you always assume the match is correct.
Oooooohhhhh, I definitely like that idea and fairly straightforward to implement. Hadn't thought of doing something along these lines, when I've been combining data previously I've always made the assumption 'unique = unique, duplicate = combine' and then gone through the uniques manually to ensure no duplication. Whereas with this method I could 'unique = unique, duplicate = combine, semi-duplicate = mark and review'.
Viewing 7 posts - 1 through 6 (of 6 total)
You must be logged in to reply to this topic. Login to reply