SSIS Custom dedupe/counter at data import - Need inspiration

  • Hi All,

    I have created an SSIS package to import data from an Excel file, then lookup additional info from an SQL datatable (based on multiple criteria), then export to a datatable destination.

    I would like to add another step to this process but i'm not sure what I would need to do this, possibly a custom script/SQL task.

    After the lookup task has completed I would like to dedupe records based on whether the reference has appeared already in the last 7 days (each row has a reference number and datetimestamp). The data file has records in date order, so I think what would work is a rolling equation that looks at those records in the last 7 days only, then either assigns "1" for unique or "0" for duplicates, in a seperate column (the earliest recorded reference would be the unique value).

    I'm not sure how I achieve this though, so I am looking for inspiration, or open to any other suggestions.

    Thanks in advance

  • piet_dj (4/24/2008)


    Hi All,

    I have created an SSIS package to import data from an Excel file, then lookup additional info from an SQL datatable (based on multiple criteria), then export to a datatable destination.

    I would like to add another step to this process but i'm not sure what I would need to do this, possibly a custom script/SQL task.

    After the lookup task has completed I would like to dedupe records based on whether the reference has appeared already in the last 7 days (each row has a reference number and datetimestamp). The data file has records in date order, so I think what would work is a rolling equation that looks at those records in the last 7 days only, then either assigns "1" for unique or "0" for duplicates, in a seperate column (the earliest recorded reference would be the unique value).

    I'm not sure how I achieve this though, so I am looking for inspiration, or open to any other suggestions.

    Thanks in advance

    From what I can gather from your request, it looks like the easiest way is to join the data table to a subquery that groups all records in the last 7 days by reference and filters out the unique records using the having clause. Maybe you could change the following to fit your needs?

    ALTER TABLE dataTable ADD is_unique bit DEFAULT 1

    UPDATEdataTable

    SET is_unique = 0

    FROMdataTable dt join (

    SELECTreference

    , count(*) as ref_count

    FROMdataTable

    WHEREdatetimestamp >= dateadd(d,-7,getdate())

    GROUP BY reference

    HAVING count(*) > 1) as rc on

    dt.reference = rc.reference

    WHERE dt.datetimestamp >= dateadd(d,-7,getdate())

    --------
    [font="Tahoma"]I love deadlines. I like the whooshing sound they make as they fly by. -Douglas Adams[/font]

  • Hi Rob,

    Thanks for your quick response. I'm new to SSIS and I don't think I explained enough earlier. I want every record to be included in the destination import, with a rolling calculation looking back 7 days, for every row, then moving on to the next and applying the same 7 day rule.

    This is what I would like to achieve:

    Ref No-----------Date-----------Unique

    123456--------01/04/2008---------- 1

    123457--------01/04/2008---------- 1

    123456--------05/04/2008---------- 0

    123457--------10/04/2008---------- 1

    I'd like to do this after the lookup step has completed, then import the results into the SQL datatable.

    many thanks

  • Ah, in that case, although a cursor or table variable would probably be the obvious answer, try the following and see if it hits the spot. It adds the column marking them all as unique, and then updates those rows where the the ref has cropped up in the 7 days prior to the record, and the counterpart. Is that what you're after?

    ALTER TABLE dataTable ADD is_unique bit NOT NULL DEFAULT 1

    UPDATE dataTable

    SET is_unique = 0

    FROM dataTable dt join (

    SELECT dt1.reference

    , dt1.datetimestamp

    , count(*) as ref_count

    FROM dataTable dt1 left outer join dataTable dt2 on

    dt1.reference = dt2.reference

    WHERE dt2.datetimestamp between dateadd(d,-7,dt1.datetimestamp) and dt1.datetimestamp

    GROUP BY dt1.reference, dt1.datetimestamp

    HAVING count(*) > 1) as rc on

    dt.reference = rc.reference

    and dt.datetimestamp between dateadd(d,-7,rc.datetimestamp) and rc.datetimestamp

    There are far more talented SQL Ninjas on this site, so hopefully someone else will refine this / point out the obvious flaws!

    --------
    [font="Tahoma"]I love deadlines. I like the whooshing sound they make as they fly by. -Douglas Adams[/font]

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply