Reconcile employees with multiple IDs into Unique IDs

  • Hi,

    I have a scenario where my employee table has multiple IDs. The background to this scenario is that due to a change in the reporting systems employees who overlapped during the change ended up with more than one ID as in the table below

    CREATE TABLE #input 
    (
    Emp_codevarchar(300),
    Employee_Namevarchar(300),
    SSN varchar(300)
    );

    INSERT INTO #input (Emp_code,Employee_Name,SSN ) VALUES ('A00001', 'Employee Name 1', 'SSN1 ');
    INSERT INTO #input (Emp_code,Employee_Name,SSN ) VALUES ('A00002', 'Employee Name 2', 'SSN2 ');
    INSERT INTO #input (Emp_code,Employee_Name,SSN ) VALUES ('A00003', 'Employee Name 3', 'SSN3 ');
    INSERT INTO #input (Emp_code,Employee_Name,SSN ) VALUES ('A00004', 'Employee Name 4', 'SSN4 ');
    INSERT INTO #input (Emp_code,Employee_Name,SSN ) VALUES ('A00005', 'Employee Name 5', 'SSN5 ');
    INSERT INTO #input (Emp_code,Employee_Name,SSN ) VALUES ('A00006', 'Employee Name 6', 'SSN6 ');
    INSERT INTO #input (Emp_code,Employee_Name,SSN ) VALUES ('A00007', 'Employee Name 7', 'SSN7 ');
    INSERT INTO #input (Emp_code,Employee_Name,SSN ) VALUES ('B90001', 'Employee Name 1', 'SSN1 ');
    INSERT INTO #input (Emp_code,Employee_Name,SSN ) VALUES ('B90002', 'Employee Name 2', 'SSN2 ');
    INSERT INTO #input (Emp_code,Employee_Name,SSN ) VALUES ('B90003', 'Employee Name 3', 'SSN3 ');
    INSERT INTO #input (Emp_code,Employee_Name,SSN ) VALUES ('B90005', 'Employee Name 5', 'SSN5 ');
    INSERT INTO #input (Emp_code,Employee_Name,SSN ) VALUES ('B90006', 'Employee Name 6', 'SSN6 ');
    INSERT INTO #input (Emp_code,Employee_Name,SSN ) VALUES ('B90007', 'Employee Name 7', 'SSN7 ');

    As can be seen from the table included, Employee Name 1 has two employee codes - A00001 and B90001. However, every employee has a unique SSN.  In the list above, only Employee Name 4 has  one Employee Code, which means this employee didn't exist in the previous system

    I am trying to reconcile both sets of employees codes based on SSN such that there is a unique employee code which is a combination of the "New System"+"Old System" (where available)

    The output I am trying to achieve should look something like this

    CREATE TABLE #output 
    (
    Emp_codevarchar(300),
    Employee_Namevarchar(300),
    SSN varchar(300)
    );

    INSERT INTO #output (Emp_code,Employee_Name,SSN ) VALUES ('A00001_B90001', 'Employee Name 1', 'SSN1 ');
    INSERT INTO #output (Emp_code,Employee_Name,SSN ) VALUES ('A00002_B90002', 'Employee Name 2', 'SSN2 ');
    INSERT INTO #output (Emp_code,Employee_Name,SSN ) VALUES ('A00003_B90003', 'Employee Name 3', 'SSN3 ');
    INSERT INTO #output (Emp_code,Employee_Name,SSN ) VALUES ('A00004', 'Employee Name 4', 'SSN4 ');
    INSERT INTO #output (Emp_code,Employee_Name,SSN ) VALUES ('A00005_B90005', 'Employee Name 5', 'SSN5 ');
    INSERT INTO #output (Emp_code,Employee_Name,SSN ) VALUES ('A00006_B90006', 'Employee Name 6', 'SSN6 ');
    INSERT INTO #output (Emp_code,Employee_Name,SSN ) VALUES ('A00007_B90007', 'Employee Name 7', 'SSN7 ');

    I attempted this using Pivot table but getting stuck in splitting the SSN into multiple employee ids

    Any help would be greatly appreciated

    Thanks!

     

  • Hi ,

    you can try using below query

    Select Employee_Name,SSN,STRING_AGG(Emp_code,'_') within group (order by Emp_code) from #input

    group by Employee_Name,SSN

  •  

    SELECT
    MIN(Emp_code) + CASE WHEN COUNT(*) > 1 THEN '_' + MAX(Emp_code) ELSE '' END AS Emp_code,
    MAX(Employee_Name) AS Employee_Name,
    SSN
    FROM #input
    GROUP BY SSN
    ORDER BY SSN

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • Thanks for that.  The sample I included was just a subset of my table. The information I am working with is payroll data. The table contains employees with multiple pay dates and pay types such tax, leave, bonus, and a few other specific to the company. This adds an extra level of granularity because of which there can be multiple instances of an Employee ID

    I have tried to expand my dataset a bit more in the input table below if it helps

    CREATE TABLE #input 
    (
    Employee_ID varchar(300),
    SSN varchar(300),
    Pay_ID varchar(300),
    Date varchar(300),
    AmountPaid varchar(300),
    Col1 varchar(300),
    Col2 varchar(300),
    Col3 varchar(300),
    Col4 varchar(300),
    source varchar(300)
    );

    INSERT INTO #input (Employee_ID,SSN,Pay_ID,Date,AmountPaid,Col1,Col2,Col3,Col4,source )
    VALUES
    ('A00001', 'SSN1', 'A00001_20170910', '10/09/2017', '412', 'RandomText1', 'RandomText2', 'RandomText3', 'RandomText4', 'SYS1 '),
    ('A00001', 'SSN1', 'A00001_20170910', '10/09/2017', '164', 'RandomText2', 'RandomText3', 'RandomText4', 'RandomText5', 'SYS1 '),
    ('A00001', 'SSN1', 'A00001_20180114', '14/01/2018', '295', 'RandomText3', 'RandomText4', 'RandomText5', 'RandomText6', 'SYS1 '),
    ('B00001', 'SSN1', 'B00001_20201210', '10/12/2020', '146', 'RandomText4', 'RandomText5', 'RandomText6', 'RandomText7', 'SYS1 '),
    ('B00001', 'SSN1', 'B00001_20201210', '10/12/2020', '183', 'RandomText5', 'RandomText6', 'RandomText7', 'RandomText8', 'SYS1 '),
    ('B00001', 'SSN1', 'B00001_20201210', '10/12/2020', '211', 'RandomText6', 'RandomText7', 'RandomText8', 'RandomText9', 'SYS1 '),
    ('A00001', 'SSN1', 'A00001_20170521', '21/05/2017', '271', 'RandomText7', 'RandomText7', 'RandomText7', 'RandomText7', 'SYS2 '),
    ('A00001', 'SSN1', 'A00001_20170409', '9/04/2017', '310', 'RandomText8', 'RandomText8', 'RandomText8', 'RandomText8', 'SYS2 '),
    ('A00001', 'SSN1', 'A00001_20170409', '9/04/2017', '449', 'RandomText9', 'RandomText9', 'RandomText9', 'RandomText9', 'SYS2 '),
    ('A00001', 'SSN1', 'A00001_20170312', '12/03/2017', '385', 'RandomText10', 'RandomText10', 'RandomText10', 'RandomText10', 'SYS2 '),
    ('A00002', 'SSN2', 'A00002_20201126', '26/11/2020', '353', 'RandomText11', 'RandomText11', 'RandomText11', 'RandomText11', 'SYS1 '),
    ('A00002', 'SSN2', 'A00002_20201126', '26/11/2020', '305', 'RandomText12', 'RandomText12', 'RandomText12', 'RandomText12', 'SYS1 '),
    ('A00002', 'SSN2', 'A00002_20201126', '26/11/2020', '481', 'RandomText13', 'RandomText13', 'RandomText13', 'RandomText13', 'SYS1 '),
    ('B00002', 'SSN2', 'B00002_20160612', '12/06/2016', '461', 'RandomText14', 'RandomText14', 'RandomText14', 'RandomText14', 'SYS2 '),
    ('B00002', 'SSN2', 'B00002_20160612', '12/06/2016', '363', 'RandomText15', 'RandomText15', 'RandomText15', 'RandomText15', 'SYS2 '),
    ('B00002', 'SSN2', 'B00002_20160612', '12/06/2016', '499', 'RandomText16', 'RandomText16', 'RandomText16', 'RandomText16', 'SYS2 '),
    ('B00002', 'SSN2', 'B00002_20160626', '26/06/2016', '362', 'RandomText17', 'RandomText17', 'RandomText17', 'RandomText17', 'SYS2 ');

    When I tried the Sting_Agg code, the result I get is a column with as many occurrences of an ID separated by a delimiter, which is not what I need. The expected output should be

    CREATE TABLE #output 
    (
    Employee_IDvarchar(300),
    SSNvarchar(300),
    Pay_IDvarchar(300),
    Datevarchar(300),
    AmountPaidvarchar(300),
    Col1varchar(300),
    Col2varchar(300),
    Col3varchar(300),
    Col4varchar(300),
    source varchar(300)
    );

    INSERT INTO #output (Employee_ID,SSN,Pay_ID,Date,AmountPaid,Col1,Col2,Col3,Col4,source )
    VALUES
    ('A00001_B00001', 'SSN1', 'A00001_20170910', '10/09/2017', '412', 'RandomText1', 'RandomText2', 'RandomText3', 'RandomText4', 'SYS1 '),
    ('A00001_B00001', 'SSN1', 'A00001_20170910', '10/09/2017', '164', 'RandomText2', 'RandomText3', 'RandomText4', 'RandomText5', 'SYS1 '),
    ('A00001_B00001', 'SSN1', 'A00001_20180114', '14/01/2018', '295', 'RandomText3', 'RandomText4', 'RandomText5', 'RandomText6', 'SYS1 '),
    ('A00001_B00001', 'SSN1', 'B00001_20201210', '10/12/2020', '146', 'RandomText4', 'RandomText5', 'RandomText6', 'RandomText7', 'SYS1 '),
    ('A00001_B00001', 'SSN1', 'B00001_20201210', '10/12/2020', '183', 'RandomText5', 'RandomText6', 'RandomText7', 'RandomText8', 'SYS1 '),
    ('A00001_B00001', 'SSN1', 'B00001_20201210', '10/12/2020', '211', 'RandomText6', 'RandomText7', 'RandomText8', 'RandomText9', 'SYS1 '),
    ('A00001_B00001', 'SSN1', 'A00001_20170521', '21/05/2017', '271', 'RandomText7', 'RandomText7', 'RandomText7', 'RandomText7', 'SYS2 '),
    ('A00001_B00001', 'SSN1', 'A00001_20170409', '9/04/2017', '310', 'RandomText8', 'RandomText8', 'RandomText8', 'RandomText8', 'SYS2 '),
    ('A00001_B00001', 'SSN1', 'A00001_20170409', '9/04/2017', '449', 'RandomText9', 'RandomText9', 'RandomText9', 'RandomText9', 'SYS2 '),
    ('A00001_B00001', 'SSN1', 'A00001_20170312', '12/03/2017', '385', 'RandomText10', 'RandomText10', 'RandomText10', 'RandomText10', 'SYS2 '),
    ('A00002_B00002', 'SSN2', 'A00002_20201126', '26/11/2020', '353', 'RandomText11', 'RandomText11', 'RandomText11', 'RandomText11', 'SYS1 '),
    ('A00002_B00002', 'SSN2', 'A00002_20201126', '26/11/2020', '305', 'RandomText12', 'RandomText12', 'RandomText12', 'RandomText12', 'SYS1 '),
    ('A00002_B00002', 'SSN2', 'A00002_20201126', '26/11/2020', '481', 'RandomText13', 'RandomText13', 'RandomText13', 'RandomText13', 'SYS1 '),
    ('A00002_B00002', 'SSN2', 'B00002_20160612', '12/06/2016', '461', 'RandomText14', 'RandomText14', 'RandomText14', 'RandomText14', 'SYS2 '),
    ('A00002_B00002', 'SSN2', 'B00002_20160612', '12/06/2016', '363', 'RandomText15', 'RandomText15', 'RandomText15', 'RandomText15', 'SYS2 '),
    ('A00002_B00002', 'SSN2', 'B00002_20160612', '12/06/2016', '499', 'RandomText16', 'RandomText16', 'RandomText16', 'RandomText16', 'SYS2 '),
    ('A00002_B00002', 'SSN2', 'B00002_20160626', '26/06/2016', '362', 'RandomText17', 'RandomText17', 'RandomText17', 'RandomText17', 'SYS2 ');

    The number of rows remains the same in the input and the output. Because an SSN can have two ids associated with, I want to combine the two individual ids coming from the different systems and make it the new Employee ID

    Hope that helped

    Thanks

    • This reply was modified 2 years, 3 months ago by  masterelaichi.
  • How about something like this?

    WITH ssn_empID AS (
    SELECT DISTINCT
    Employee_ID, SSN
    FROM #input
    ),
    ssn_empIDagg AS (
    SELECT
    STRING_AGG(Employee_ID,'_') AS newEmployee_ID,
    SSN
    FROM ssn_empID
    GROUP BY SSN
    )
    SELECT
    se.newEmployee_ID,
    i.*
    FROM #input i
    JOIN ssn_empIDagg se ON se.ssn = i.ssn
  • hmm, I tried the above code. I end up getting a number ids concatenated together. For example, if ID A00001 has multiple rows on 3 paydates and Id B00001 has multiple rows for 2 paydates, I end up getting "A00001_A00001 _A00001_B00001 _B00001"

  • masterelaichi wrote:

    However, every employee has a unique SSN.

    For the love of all that is holy in SQL Server, please tell me that your SSNs are NOT in plain, 9 digit text! If they are, you folks really need to fix them and I mean NOW!  And, no... I'm not talking about hashing... I'm talking about real encryption with a salt and a master key, etc.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • The column in my dataset is not actually SSN. I just used it as example in this scenario to highlight a unique column

  • Hopefully, that's true.  The last person that told me that was at a company the I worked at temporarily.  They told me that AFTER I had already seen that it was untrue.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I think I managed to solve it using two CTEs and unioning them.  I then used another CTE on the combined table to remove any duplicates

    • This reply was modified 2 years, 2 months ago by  masterelaichi.
  • I'm pretty sure you don't need multiple CTE's to accomplish this task.   Given the data supplied in the original post, the following should work (note that varchar(300) is rather wasteful for the data supplied, so I brought that down to just 20):

    CREATE TABLE #input (
    Employee_ID varchar(20),
    SSN varchar(20),
    Pay_ID varchar(20),
    [Date] varchar(20),
    AmountPaid varchar(20),
    Col1 varchar(20),
    Col2 varchar(20),
    Col3 varchar(20),
    Col4 varchar(20),
    [source] varchar(20)
    );

    INSERT INTO #input (Employee_ID,SSN,Pay_ID,Date,AmountPaid,Col1,Col2,Col3,Col4,source )
    VALUES
    ('A00001', 'SSN1', 'A00001_20170910', '10/09/2017', '412', 'RandomText1', 'RandomText2', 'RandomText3', 'RandomText4', 'SYS1 '),
    ('A00001', 'SSN1', 'A00001_20170910', '10/09/2017', '164', 'RandomText2', 'RandomText3', 'RandomText4', 'RandomText5', 'SYS1 '),
    ('A00001', 'SSN1', 'A00001_20180114', '14/01/2018', '295', 'RandomText3', 'RandomText4', 'RandomText5', 'RandomText6', 'SYS1 '),
    ('B00001', 'SSN1', 'B00001_20201210', '10/12/2020', '146', 'RandomText4', 'RandomText5', 'RandomText6', 'RandomText7', 'SYS1 '),
    ('B00001', 'SSN1', 'B00001_20201210', '10/12/2020', '183', 'RandomText5', 'RandomText6', 'RandomText7', 'RandomText8', 'SYS1 '),
    ('B00001', 'SSN1', 'B00001_20201210', '10/12/2020', '211', 'RandomText6', 'RandomText7', 'RandomText8', 'RandomText9', 'SYS1 '),
    ('A00001', 'SSN1', 'A00001_20170521', '21/05/2017', '271', 'RandomText7', 'RandomText7', 'RandomText7', 'RandomText7', 'SYS2 '),
    ('A00001', 'SSN1', 'A00001_20170409', '9/04/2017', '310', 'RandomText8', 'RandomText8', 'RandomText8', 'RandomText8', 'SYS2 '),
    ('A00001', 'SSN1', 'A00001_20170409', '9/04/2017', '449', 'RandomText9', 'RandomText9', 'RandomText9', 'RandomText9', 'SYS2 '),
    ('A00001', 'SSN1', 'A00001_20170312', '12/03/2017', '385', 'RandomText10', 'RandomText10', 'RandomText10', 'RandomText10', 'SYS2 '),
    ('A00002', 'SSN2', 'A00002_20201126', '26/11/2020', '353', 'RandomText11', 'RandomText11', 'RandomText11', 'RandomText11', 'SYS1 '),
    ('A00002', 'SSN2', 'A00002_20201126', '26/11/2020', '305', 'RandomText12', 'RandomText12', 'RandomText12', 'RandomText12', 'SYS1 '),
    ('A00002', 'SSN2', 'A00002_20201126', '26/11/2020', '481', 'RandomText13', 'RandomText13', 'RandomText13', 'RandomText13', 'SYS1 '),
    ('B00002', 'SSN2', 'B00002_20160612', '12/06/2016', '461', 'RandomText14', 'RandomText14', 'RandomText14', 'RandomText14', 'SYS2 '),
    ('B00002', 'SSN2', 'B00002_20160612', '12/06/2016', '363', 'RandomText15', 'RandomText15', 'RandomText15', 'RandomText15', 'SYS2 '),
    ('B00002', 'SSN2', 'B00002_20160612', '12/06/2016', '499', 'RandomText16', 'RandomText16', 'RandomText16', 'RandomText16', 'SYS2 '),
    ('B00002', 'SSN2', 'B00002_20160626', '26/06/2016', '362', 'RandomText17', 'RandomText17', 'RandomText17', 'RandomText17', 'SYS2 ');

    WITH AGGREGATES AS (

    SELECT DISTINCT I.SSN, I.Employee_ID
    FROM #input AS I
    )
    SELECT
    A.SSN,
    STRING_AGG(A.Employee_ID, '_') WITHIN GROUP (ORDER BY A.Employee_ID) AS NEW_EMP_ID
    FROM AGGREGATES AS A
    GROUP BY A.SSN
    ORDER BY A.SSN;

    The results are:

    SSNNEW_EMP_ID
    SSN1A00001_B00001
    SSN2A00002_B00002

    • This reply was modified 2 years, 2 months ago by  sgmunson. Reason: o instead of 0 needed fixed
    • This reply was modified 2 years, 2 months ago by  sgmunson. Reason: Forgot to include results

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • The reason I used two CTEs is because I have another dataset in which the ID are in two different columns leading to duplicates

    With the sample above, I am not able to extract the complete payroll data with all the columns. Here is the code I tried

    WITH AGGREGATES AS (

    SELECT DISTINCT I.*
    FROM #input AS I
    )
    SELECT
    A.SSN,
    STRING_AGG(A.Employee_ID, '_') WITHIN GROUP (ORDER BY A.Employee_ID) AS NEW_EMP_ID
    FROM AGGREGATES AS A
    GROUP BY A.SSN
    ORDER BY A.SSN;


    Result
    SSNNEW_EMP_ID
    SSN1A00001_A00001_A00001_A00001_A00001_A00001_A00001_B00001_B00001_B00001
    SSN2A00002_A00002_A00002_B00002_B00002_B00002_B00002

     

     

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply