Scalable SSIS Package - Single SSIS package across 50 different projects

  • Hi All,

    I have a successfully running SSIS package which has been created for a single project. It works over a pipe-delimited file and pulls the data from the source, and loads it directly into the destination table. For each projects, I am using the same job to execute based on a job execution table. The SSIS package checks a job execution table whether any job is there to run for the day and if its there will run the for the project. Once the execution for the project is completed successfully, the SSIS execute SQL task creates an entry in the job execution table for the next day. The metadata doesn't change across the project and hence I am using a single SSIS package to run across different projects.

    The volume that is being loaded is low for now and the number of projects is also less. Hence I don't find any issues with the job.

    However when thinking about future, if the volume is high, say more than 100000 records in each file and each project will have ten files and the frequency for each project is every 5 hrs for a total of 50 projects, then there might be an issue with the current approach. Hence I want to know how to make this as a scalable project?

    • Is there any way by which I can run multiple threads in a single package? Each thread running its own project or file using the job execution table?

    I am looking out for a better architecture for my project so that it's more scalable and sophisticated. Thanks in advance for the ideas.

  • One possibility would be to add a 'WorkerNumber' (INT) column to your JobExecution table. Let's say you have 20 jobs, which would equate to 4 groups, each containing 5 jobs.

    Now create 4 SQL Agent jobs which call your package, passing the WorkerNumber as a parameter.

    Your job executes the next available job corresponding with its WorkerNumber.

    If you want to implement parallelism purely from within SSIS, some more extensive changes would be required ... but it can certainly be done.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Thanks Phil,

    Currently I use seperate jobs for each customer. However, when I change any config parameters then each job needs to be reconfigured.

    Also I need to use dynamic column mapping instead of static mapping. The reason is few customers metadata that I foresee will be different from what its currently. Hence having separate packages for each customer will affect the maintainability.

    Currently what I have done is I have a table in which the source column and the destination columns are mapped. Using C# scripts, I read the first row of the source file, check that with the source columns retrieved from the mapping table. I am retrieving the matching source columns between file and table and load the data in a datatable in c#. Then I create a SQL query using C# to insert the values into the destination table. This is because the column name changes from one customer to another.

    Inserting rows from DataTable consumes time as it inserts data one by one. Is there a better way to handle this. Hence, I want to know what methods I can use for Parallelism.

  • Unfortunately, your flexible architecture has resulted in a performance trade-off which is not easily remedied.

    If you can buy and install third party products, you might like to think about Cozyroc's data flow task, which should, in theory, give you the performance of a standard data flow with dynamic column mappings. Disclaimer: I have never used this task myself, though I have used other Cozyroc SSIS components and they worked well.

    Another possible approach is to create very simple packages, one per customer, to do the initial import, into 'landing' tables which have the same format as the incoming files. From there, create a stored proc which outputs the customer's data in your standard, conformed format. Use this proc as the input to your ETL process. As long as the procs all provide data in the same format, you can use one single package to process them all.

    There's more maintenance required, but it should scale and it should be fast.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply