SSIS Custom dedupe/counter at data import - Need inspiration

Question

SSIS Custom dedupe/counter at data import - Need inspiration

piet_dj

SSC-Addicted

Points: 428
More actions
April 24, 2008 at 7:58 am

#123461

Hi All,
I have created an SSIS package to import data from an Excel file, then lookup additional info from an SQL datatable (based on multiple criteria), then export to a datatable destination.
I would like to add another step to this process but i'm not sure what I would need to do this, possibly a custom script/SQL task.
After the lookup task has completed I would like to dedupe records based on whether the reference has appeared already in the last 7 days (each row has a reference number and datetimestamp). The data file has records in date order, so I think what would work is a rolling equation that looks at those records in the last 7 days only, then either assigns "1" for unique or "0" for duplicates, in a seperate column (the earliest recorded reference would be the unique value).
I'm not sure how I achieve this though, so I am looking for inspiration, or open to any other suggestions.
Thanks in advance

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply

Rob Goddard SSCrazy Points: 2622 More actions · Answer 1

piet_dj (4/24/2008)
Hi All,
I have created an SSIS package to import data from an Excel file, then lookup additional info from an SQL datatable (based on multiple criteria), then export to a datatable destination.
I would like to add another step to this process but i'm not sure what I would need to do this, possibly a custom script/SQL task.
After the lookup task has completed I would like to dedupe records based on whether the reference has appeared already in the last 7 days (each row has a reference number and datetimestamp). The data file has records in date order, so I think what would work is a rolling equation that looks at those records in the last 7 days only, then either assigns "1" for unique or "0" for duplicates, in a seperate column (the earliest recorded reference would be the unique value).
I'm not sure how I achieve this though, so I am looking for inspiration, or open to any other suggestions.
Thanks in advance

From what I can gather from your request, it looks like the easiest way is to join the data table to a subquery that groups all records in the last 7 days by reference and filters out the unique records using the having clause. Maybe you could change the following to fit your needs?

ALTER TABLE dataTable ADD is_unique bit DEFAULT 1

UPDATEdataTable

SET is_unique = 0

FROMdataTable dt join (

SELECTreference

, count(*) as ref_count

FROMdataTable

WHEREdatetimestamp >= dateadd(d,-7,getdate())

GROUP BY reference

HAVING count(*) > 1) as rc on

dt.reference = rc.reference

WHERE dt.datetimestamp >= dateadd(d,-7,getdate())

--------
[font="Tahoma"]I love deadlines. I like the whooshing sound they make as they fly by. -Douglas Adams[/font]

piet_dj SSC-Addicted Points: 428 More actions · Answer 2

Hi Rob,

Thanks for your quick response. I'm new to SSIS and I don't think I explained enough earlier. I want every record to be included in the destination import, with a rolling calculation looking back 7 days, for every row, then moving on to the next and applying the same 7 day rule.

This is what I would like to achieve:

Ref No-----------Date-----------Unique

123456--------01/04/2008---------- 1

123457--------01/04/2008---------- 1

123456--------05/04/2008---------- 0

123457--------10/04/2008---------- 1

I'd like to do this after the lookup step has completed, then import the results into the SQL datatable.

many thanks

Rob Goddard SSCrazy Points: 2622 More actions · Answer 3

Ah, in that case, although a cursor or table variable would probably be the obvious answer, try the following and see if it hits the spot. It adds the column marking them all as unique, and then updates those rows where the the ref has cropped up in the 7 days prior to the record, and the counterpart. Is that what you're after?

ALTER TABLE dataTable ADD is_unique bit NOT NULL DEFAULT 1

UPDATE dataTable

SET is_unique = 0

FROM dataTable dt join (

SELECT dt1.reference

, dt1.datetimestamp

, count(*) as ref_count

FROM dataTable dt1 left outer join dataTable dt2 on

dt1.reference = dt2.reference

WHERE dt2.datetimestamp between dateadd(d,-7,dt1.datetimestamp) and dt1.datetimestamp

GROUP BY dt1.reference, dt1.datetimestamp

HAVING count(*) > 1) as rc on

dt.reference = rc.reference

and dt.datetimestamp between dateadd(d,-7,rc.datetimestamp) and rc.datetimestamp

There are far more talented SQL Ninjas on this site, so hopefully someone else will refine this / point out the obvious flaws!

--------
[font="Tahoma"]I love deadlines. I like the whooshing sound they make as they fly by. -Douglas Adams[/font]