A single, dynamic package/flow to handle change data capture for several tables

  • Hi all!

    Being pretty new to SSIS and new to this forum I hope my post does not represent something that has been asked over and over again. If so, please accept my apologies.

    Here is my situation:

    I am pretty new to BI and data warehouse development. As part of a school project I have recently been involved in a small data warehouse solution in which I have constructed a "dynamic" SQL script that runs in the staging area and that calculates if incoming data is New, Updated, Deleted or Same, compared to the existing data in the data warehouse. As I have learned, this is standard functionality for a Data Warehouse solution. We chose to build our own functionality for this eventhough we know there are already known methods to achieve it since those methods seem to require a certain setup and configuration in the Source system. Since we early on decided not to change anything in the source system at all the need for a home-made change data capture handling was obvious.

    The script is "dynamic" in the sense that no tables or columns are hardcoded, instead it uses some set conditions to find out which tables and key columns it needs for the comparison with existing data. A dynamic SQL command is built up and executed via EXEC as the last step (full outer join between IncomingDataTable and ExistingDataTable) and from which conclusions can be drawn whether the incoming data is New, Updated, Deleted or Same.

    The script works fine and does what it is supposed to (eventhough probably a bit slow since it includes the use of cursors :blush:).

    Anyway, now when the project is over (it went fine!), I am looking into the solution again and trying to see what functionality in the SQL-scripts that constitutes our DW-solution, we could perhaps achieve using SSIS instead. And that's why I have now ended up here writing this thread and can finally state my question, for which I would be very thankful to receive any input:

    I want to achieve the same functionality that the SQL-script described above does, in SSIS, in a single, "dynamic" package. Is this possible?

    I can successfully do this for one known, single table in which the primary/join key columns are known in beforehand and can be specified in for example a merge join SSIS item, but I am looking for the same dynamic way of handling it in SSIS, as done in the SQL script previously created. I don't want to create one specific package/flow for each table (could be hundreds of tables!) and then execute them all in sequence.

    I have already started trying to achieve this in SSIS, and I have come so far that I can successfully build up a dynamic SQL string that SELECTs the data needed (primary key columns and checksum column) for each table. "Execute SQL" tasks and "For each ADO item loops" tasks are used to achieve this. But here I am stuck. My intention was now to perform a merge join between IncomingDataTable and ExistingDataTable. But because of the dynamic SQL stored in a variable that is used as data source in the two OLE DB Sources (existing and new), I am unable to specify the Join Keys that the Merge Join component needs, since the keys are not know until the package executes. Of course I could always script the rest as SQL in an Execute SQL task or something similar, but going that path is more or less copying what I have already done once and that is meaningless. If there is a way forward using other methods and components in SSIS I would really like to use that.

    What surprises me about all this is that I have found many examples and tips on the net about how to achieve these things (like change data capture and slowly changing dimension handling), but everything seems to be exemplified and set up based on that the user is working with a single table, and if you have many tables, you have to create one package or flow for each table, and in each of those specify the columns and other things in the different SSIS components used.

    But creating one package for each table really can't be the best solution here right? There just have to be another, smarter way? I really must be missing something. In this project it is only a matter of 19 source tables from a single database, but what in much bigger projects, with let's say 1000 source tables?

    I am sorry if I am missing something really basic about SSIS here, like some component or concept that can be used to do this in just a second.

    I am very thankful for any input on this. Just ask if something is unclear and I will do my best to explain better! And sorry for the looong post 🙂

    /Martin

  • You would have to create a single package for each table which you wish to operate upon. Because SSIS is metadata driven (meaning that your columns/datatypes cannot change at run time), if you write the dynamic SQL in an Execute SQL task, and then expect to pass that to your Source component at run time, it will not work.

    It's a limitation of SSIS, unfortunately.

  • Ok I understand. I am a bit surprised that one package per table is the only solution to this using SSIS, but I guess I have to just live with it 🙂

    Thanks for the reply Justin!

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply