January 18, 2011 at 8:39 am
Hi,
Can you help me with this SSIS 2008 package pls.
I have a Library Data warehouse with a cutomer table – DimCustomer in it. A customer flat file extract comes from each of the branches of the library on a weekly basis. A customer could register in multiple branches – hence come as part of multiple files - so need to make sure only one instance of that customer ends up in DimCustomer table.
Here are the 2 things i need to implement whist loading this customer table:
1. 1st I need to dedupe within each of the files themselves – in case a branch sends the same customer (LibID, name & address) multiple times in the same file.
2. All already generated existing DimCustomerKEYs and the corresponding LibIDs are stored in a customer lookup table. So, the next step is to compare the extract against that lookup table for a match to determine if they are existing customer or new. If they fail, they will also be fuzzy matched against the DimCustomer table on their name & address. Depending on the outcome, I will either insert a new row in DimCustomer or update their existing record and the lookup table with the new LibID they’ve come with.
I am pretty confident with using SCD part of the package. Just need help with the de-duping especially (step 1 above) and fuzzy lookup.
Infact a shell of what the package should look like would be much appreciated.
Thank you.
January 19, 2011 at 11:10 am
Have you looked at MergeJoin task with Conditional Split?
You supply a flow from two sources and check for the links. Then use a conditional split task to flow the data where you need it to go
January 19, 2011 at 11:41 am
Could be too simple an approach (ie your problem could be more than I read into it) but why not simply aggregate? A group by of the key fields (ID and Address) would eliminate duplicates implicitly.
The main issue I would see using this approach is if a single ID was supplied with >1 address (ie steve_id, 123 West St and steve_id, 456 East St), however, this issue is more than a technical one, it requires business input because without that, your choice of which one is 'correct' is arbitrary.
Steve.
Viewing 3 posts - 1 through 2 (of 2 total)
You must be logged in to reply to this topic. Login to reply