Iterate Through Tables, Using a Control Table w/ a List of Tables to Process?

  • I'm trying to create an SSIS package that will query a master control table [TableList], which contains 2 columns (SourceTable and TargetTable), then load those tables from a source server to a target server.

    It "runs", but although the variables with the table names seem to update for each iteration, it looks like the table metadata doesn't update for each new iteration for each new pair of tables.  So the package errors out on the second step through that indicates it's still looking for columns from the first table:

    Error: 0xC0202005 at Data Flow Task, OLE DB Source [105]: Column "facility_id" cannot be found at the datasource.

    Error: 0xC004701A at Data Flow Task, SSIS.Pipeline: OLE DB Destination failed the pre-execute phase and returned error code 0xC0202025.

    Information: 0x4004300B at Data Flow Task, SSIS.Pipeline: "OLE DB Destination" wrote 0 rows.

    (where facility_id is a column in the first table, but not in subsequent tables).

    Here's what I've done:

    1. Created a control table with 2 columns, SourceTable and TargetTable.  Added just 2 rows for testing (the strings inserted include schema names also), each row corresponding to 1 table from the source to load to the target.
    2. Created empty versions of the tables on target server, using DDL from the source server.
    3. In SSIS, created static OLEDB data connectors to the source and target servers.  (Server names and DB names don't change for this, only the table names).
    4. Added 3 variables: (1) TableMappings, type = Object, to hold the results of the SQL query of the control table [TableList]; (2) SourceTable (string); TargetTable (string).  Do I need to give the two string variables values?  If I manually assign correct values for the source/target string variables, then the first table does load data, then fails on the second table.  If I leave those blank, then I get errors like "The number of input columns for OLE DB Destination.Inputs[OLE DB Destination Input] cannot be zero."
    5. Dragged an Execute SQL Task in, pointed it to the target server where table TableMappings is, added a query to it [SELECT SourceTable, TargetTable FROM DBA.TableList ORDER BY SourceTable;].  Changed ResultSet to Full result set.  Under the ResultSet tab, added a result named 0, with Variable Name = User::TableMappings
    6. Dragged a Foreach Loop Container over, added a Data Flow task.
    7. In the Data Flow, added both an OLEDB source and OLEDB destination, each pointed to the correct connection, and connected them.  Changed Data access mode to "Table name or view name variable" and pointed the variable to User::SourceTable and User:TargetTable, respectively.
    8. Ran it.  Throws errors and fails.

    What I've tried to fix this:

    1. Added a watch on the Foreach loop to pause before each iteration, and added the 2 table name variables.  These names DO update correctly.
    2. Changed both OLEDB source and destination to ValidateExternalMetadata = False
    3. Changed Foreach Loop DelayValidation=True

    Based on the watch, the correct table names look like they're getting retrieved.  But it seems that the metadata for the columns in the data flow source/destination are stale and aren't refreshed for each new pair of tables.

    Thanks for any assistance you can offer!

    SQL 2016.  VS2019, with package set to 2016 mode.

     

    Rich

  • The meta data isn't going to change, that's a designer thing.

    What you're trying to do isn't going to work.

    Once you create a data flow that's it, that's the only metadata it will have.

    You'll have better luck using stored procedures if you need to do this dynamically. Then you can use your meta to build statements and execute them. Of course depending on source and dest servers/platforms this may not be an option.

    Another option might be to create the DTSX files on the fly. Maybe look at BIML (BI markup language). Never really used it so I can't offer much advice. Here's a resource: https://www.bimlscript.com/

  • Isn't this a job for "Replication"?  IIRC, it can even update the meta data in the target to match that in the source.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • If you can purchase third-party software products, the following may be of interest:

    https://www.cozyroc.com/ssis/data-flow-task

    But, as stated above, the structure of SSIS packages is fixed at design time. If you create separate dataflows for each of the tables in your control table, you can make this work, though it makes your package ungainly. Better to have separate packages (one per table) and have a master package which executes them, once again depending on what is in the control table.

    Despite the additional complexity, each of these solutions offers a potential performance improvement over your current intended architecture, because the multiple dataflows / packages may be executed in parallel.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • this is one of those cases where you do NOT want to use SSIS native dataflow to do the transfer.

      <li class="text--left" style="text-align: left;">SSIS Script - using .net sql datareader to read and sqlbulkcopy to insert

    • plain c# app - same as above
    • <li class="text--left" style="text-align: left;">powershell - same as above or using Dbatools

      <li class="text--left" style="text-align: left;">bcp (out and in)

    in all of the above you would supply source/destination server and a list of tablename pairs (source/destination name)

    if columns between source/dest are exactly the same it is easier - if they differ you will need to do a bit more code (bulkcopy columnmapping) but nothing significant.

  • Thanks everyone for your responses.  Good to have confirmation that I wasn't simply inept in not getting my test to work with dynamic iteration over a list of tables.  I'm surprised to learn that SSIS doesn't natively support dynamic refresh of table metadata (I'm not an SSIS guru by any means).

    Phil: up until now, I've been running this package with manually-created pairs of OLEDB source/destination, one pair per table.  I put all of these source-destination pairs (there are maybe 20 pairs, for 20 tables) into one data flow, in one package.  You suggested splitting these off into one-table-per-package (or per-dataflow).  Is there any reason for your recommendation?  At least in debug mode, the tables are processed in parallel when I put them all into one dataflow.

    Jeff: I have restricted access to the source server, just SELECT permissions on a single schema, so setting up replication is unlikely to be possible without a lot of conversation and negotiation.

    Thank you all again!

  • There are reasons for separating logic into multiple packages, even though it may seem less tidy and requires more config up front.

    Number 1 is around error handling and the debugging of errors. If you have 20 source-target pairs in a single dataflow and any one of them has an error, you might find it difficult to track down which pair(s) caused the error. You may also find it painful trying to reproduce the error, because all of the other successful pairs are running within the same container. If you have separate packages, this whole process is easier, from identification to resolution.

    Number 2 is control. You can control the order of execution and which packages run in parallel or series. I use an SSIS execution framework which allows me to define all of this in metadata.

    Number 3 is maintenance and adherence to software engineering best practices (each package should do one thing, and keep it simple).

    SSIS packages tend to grow over time. Before you know it, you could be dealing with a monster, unless you stick to best practices.

    Having said all of this, it does feel like overkill in your situation. One intermediate step you should consider is to put each source-target pair into its own dataflow. This would immediately give you control over execution order and parallelism and would make debugging easier, because each dataflow can be executed independently of the others (via right-click / Execute Task).

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Thanks for the detailed reply, Phil.  I guess I meant to include in my last post "is there a reason to split this up other than for debugging/error-trapping?".

    For my purposes, this is easier to maintain in one data flow, and I've been running this package with an Agent job for over a year without any errors (with about 20 tables in the one data flow).  I need to add some more tables, and I started to think this was getting a bit overloaded to maintain with all these source/destination pairs; I take your points.  Still, replacing those with 20 data flows feels to me like making this more work, not less, for me to maintain.  Sounds like you are doing some more advanced SSIS work than my simple ETL (as I've implemented most of my value-added work using post-load T-SQL stored procs, called in SSIS as Execute SQL tasks).

    I don't like being *that guy* who creates expedient rather than well-ordered code... I'll give this some more thought.

    Still a bit surprised this isn't provided as native functionality in SSIS, without resorting to third-party tools or C# coding.  (Then again, MS shipped the last version of Visual Studio without the ability to work with SSIS, so...)

    Thanks again!  I haven't been on here in quite some time, but there was a long period when I was on here almost daily, and I've learned a lot from your posts over the years!

  • No problem.

    Just picking up on one point in your last comment, I 100% agree with using stored procs to implement business logic on the SQL Server side. I mostly use SSIS to move data from A to B, where A and B are not the same database (or technology), and to orchestrate that movement. The simpler, the better.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Couple things I would like to add to Phil's reply about logic separation.

    Way back when we started with SSIS in 2005 and didn't have a clue, we built some monster packages. They wouldn't load properly and took forever and would even crash VS. One thing SSIS likes to do is validate everything upon opening. Separation helps alleviate this.

    One of the projects we created back in 2005 as one monster package is still running today but has about 30 packages. Our rule is generally one data flow per package. As Phil said it makes testing and debug much easier.

    Another thing that doesn't work well with SSIS is code merging. With separation you can have people work on individual packages and still be able to merge their changes since the code is in separate files. Even if not merging, it still allows multiple devs to work on the same project more easily.

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply