Measuring data quality using SSIS

  • Hi,

    I've just started shunting a data warehouse project, and the first (unsurprising) snag we've hit is the appalling quality of the data that we're getting. Not only bad data entry from the user, but missing and incomplete data where referential integrity is lacking in the underlying system.

    To try and bring some attention to this, I want to produce a report on the data quality, against approx 30 business-specific rules that I've identified.

    I had imagined doing this via SSIS and it's logging facilities. I am imagining that for every rule (or failed lookup), I can log a message indicating system/table/column and rule, insert this directly into the database for the run, and use this to generate reports and scores for the data.

    I haven't started doing it yet - my question is - is this easy to do in SSIS (I expect the number of rules and lookups to grow quite a lot, so I don't want to end up with a maintenance nightmare - SSIS does seem to make you repeat yourself quite a bit!). In particular, how do I filter out the relevant messages from the general runtime messages.

    Has anyone done anything like this? How did you approach it?

    Thanks,

  • It's obviously not set up specifically for your particular rules but if you have 2008, there was a new Data Profiling task added - have you tried to run this against your data? THe result requires a viewer (the result is in XML) but this is dropped to disk when you install, so you have it all in hand now.

    And the next rev of SQL (i guess post R2) will likely have data quality services to go along with the newly released data master services.

    HTH,

    Steve.

  • Thanks - I looked at the data profiling task, but as far as I can see, it's not possible to put your own rules in.

    For example, I want to log each time I change "NA" to null, each time I need to remove duplicate keys, etc...

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply