Data Flow Suggestions? Anyone?...Ferris?

  • I am tasked with populating a new relational database from a non-relational database. I'm using SSIS 2005 to create a separate package for each of the 15 -20 tables to load into the new schema. I'm typically having to perform a few field data type conversions but not a lot. Overall their are few data transforms required. Typically, I grab all the contents of the source table and have to load it into 1 or more destination tables based on field names. You know, some fields go here, while others go there type thing. I've created a number of packages thus far and noticed that it takes a very long time to run them. Some of them 3 to 5 hours! I was hoping some of you may have some suggestions as to how to speed things up. Perhaps my approach is whacked and should try something else.

    First some background info:

    - Source and Destination databases are SQL Server 2005 (Same Server too)

    - The source database schema is basically15 -20 non relational tables each containing anywhere from 250k to 5M rows/table. Some tables have about 25 fields where others only 3 or 4.

    - The destination database has far fewer tables in relational schema with contraints, etc.

    - The goal of the SSIS effort is to load each of the source tables into the new destination schema

    - Packages executed using 64-bit DTEXEC.exe

    Server Specs:

    OS NameMicrosoft(R) Windows(R) Server 2003 Enterprise x64 Edition

    Version5.2.3790 Service Pack 2 Build 3790

    Other OS Description R2

    System ModelVirtual Machine

    System Typex64-based PC

    ProcessorQuad - EM64T Family 6 Model 23 Stepping 10 GenuineIntel ~3159 Mhz

    Total Physical Memory8,191.25 MB

    A typical package includes a single Data Flow Task. Within the Data Flow task I have an OLE DB Source to grab all the source table info. Next I typically have to perform a Data Conversion or two of the fields within the pipeline followed by a Derived column operation or a simple lookup. From there it gets interesting and more or less the crux of my post. Since the data typically will be distributed over many tables I need to split the data flow up based on whatever tables need whichever fields. To do this I first used a Mult-cast transform. Each of the outputs is then fed into a Slow Changing Dimension transform for each of the destination tables to basically UPSERT the data. (These SSIS packages will be used for future refreshes too) For sake of argument I went ahead and implemented a similar package as described using a Script Component (Asynchronous as I perform a XXXBuffer.AddRow() for each input row) instead of the Mult-cast and it took almost 2x as long! Is the Script Component used in this manner typically a hog and not recommended for this type of use? One observation I noticed that when I used the Multi-cast transform only 1 of the processors was being used. When running the package Script Component I see it leverage all 4 processors yet takes 2x as long.

    Is there a better way to approach this sort of thing? I'm open for suggestions or questions.

    I've attached a couple screen captures of the two different data flows to help illustrate.

    Thanks in advance!

  • Personally, I don't think I've ever used SSIS to move data from one SQL database to another, on the same server. I use T-SQL scripts/procs for that. It's usually a lot easier, and much faster too.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Hi,

    I never did a task like this but here are my thoughts about it 🙂

    Since you load data from a source database to a new schema maybe it's an idea to create views in the old database with the column definition of the new database. You could have a view with for each destination table, do your data conversion there (using convert or cast - if you're sure that the convert operation will work) and select from this views in your ssis-package. Then you only need the UPDATE-INSERT part of your dataflow tasks (maybe you don't even need a ssis-package - a simple sql script could do that job)

    Why do you have one package for each of the source tables? Maybe it helps to create one package with one dataflow task for each of the source tables (or views) and let some of them run parallel.

    Regards.

  • Just out of interest, are you running this package (packages) in Visual Studio (ie debug) or pushing them to the SSIS service and executing them via say SSMS or dtsrun/dtsrunui?

    If you're in VS, thisis basically debug, and (unless you've turned something off somewhere) will be writing temp files of every row passing thru your dataflows, in addition to writing it to the destination. It wouldn't be your entire perf issue, but could definitely make an impact.

    Steve.

  • Jetro (4/28/2009)


    Hi,

    Why do you have one package for each of the source tables? Maybe it helps to create one package with one dataflow task for each of the source tables (or views) and let some of them run parallel.

    Thanks for your input. I sure do appreciate it! The reason I have a separate package for each source table the customer has a requirement to update the destination database when a one or more of the source tables gets refreshed. The requirement is to have this functionality thru a secured web client. Not sure yet how to hook up the plumbing for that yet. One thing at a time.

    Thanks again,

    Bob

  • stevefromOZ (4/28/2009)


    Just out of interest, are you running this package (packages) in Visual Studio (ie debug) or pushing them to the SSIS service and executing them via say SSMS or dtsrun/dtsrunui?

    Thanks for your input. I'm running them via the 64 bit dtexec.exe via command line.

  • Interesting point. I'm going to stew on that a bit.

    Thanks!

  • GSquared (4/28/2009)


    Personally, I don't think I've ever used SSIS to move data from one SQL database to another, on the same server. I use T-SQL scripts/procs for that. It's usually a lot easier, and much faster too.

    Interesting point. I'm going to stew on that a bit.

    Thanks!

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply