SSIS for Staging VERY LARGE Table Loads

  • Hello,

    I would like input from the community on very large table load strategy in SSIS. In our Oracle billing system we have some wide tables that have over a billion rows and many more with over 500 million rows. Currently we stage these tables in to our ODS using an SSIS ETL solution that loads the data in segments according the the partition scheme of the target table. This allows us to achieve maximum parallelism when loading the data. (See Article 1 & Article 2 #10).

    Article 1:

    We Loaded 1TB in 30 Minutes with SSIS, and So Can You

    https://technet.microsoft.com/en-us/library/dd537533(v=sql.100).aspx

    Article 2:

    Top 10 SQL Server Integration Services Best Practices

    http://blogs.msdn.com/b/sqlcat/archive/2013/09/16/top-10-sql-server-integration-services-best-practices.aspx

    Truncate and Load tables

    We load large truncate and load tables by breaking the data up in the segments that are aligned to the partitions. This allows us to load a table simultaneously which gives us a larger throughput rate. This method is working in our production environment, but there is concern that "There MUST be a simpler way" from upper management. I explain that this method is a recommendation from the SQL CAT team (see Article 2 #10 above), but that doesn't seem to resonate. Therefore I'm doing some research to see if there are any methods/options that could be explored.

    Incremental tables

    We load changed data on tables that have no valid date to use for detected changed data using a method of deleting 4 days from the target and loading 4 days from the source. This is working but it is causing unnecessary reloading of previously committed data. You may ask, then why not just load the previous day? This is due to our deployment process. In order to get something deployed to production or just have a parameter changed in production, it takes short of an act of God. Therefore we chose 4 days to cover weekends in case there was a connection failure and the ETL didn't run or the etl failed because of a connection failure later in the process causing partial or no data loaded to target.

    I'm looking to see if anyone else has had the same challenges with loading large datasets and what you did to overcome them. Below are some of our restrictions:

    1. We are not allowed to install anything on the source servers such as third party CDC agents, etc. The source production environment access is heavily protected.

    2. Some of the large tables don't have valid date columns that we can use to pull changed data.

    3. We have a 6 hour load window in which all tables need to be loaded for the business by 7AM EST.

    4. Our source data format is db only. Which means we currently have no way from the source side to drop the data to files. We would have to build and SSIS ETL to do this. I know that generally it is faster to load large data from flat file, but the question is would the extra time get the data to file be made up for in the load to the target tables?

    Any insight, articles, working samples, I'm W-I-D-E open.

    I'm currently experimenting with the Microsoft BDD (Balanced Data Distributor) but so far it's not what I expected. I'm still testing

    Letron

    SQL Server 2012

    Separate SSIS and Database Server

  • Nice, detailed post, well done. You've clearly done your homework on this one.

    Regarding the incremental load, have you considered using a pattern like this?

    - Query target table for MaxDate = max(datecreated,datemodified)

    - Select from source table where date >= MaxDate

    This would obviate the need for 'four-day loads' all the time.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Is the load time spread evenly across all of the processes, or is there one particular bottleneck?

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Phil Parkin (1/11/2016)


    Regarding the incremental load, have you considered using a pattern like this?

    - Query target table for MaxDate = max(datecreated,datemodified)

    - Select from source table where date >= MaxDate

    This would obviate the need for 'four-day loads' all the time.

    Thanks for the reply Phil,

    We have considered just pulling the max date from the source. That would be the most efficient, but the only problem is that if there was an issue with a load that day where it failed, manual intervention would need to be made to re-run the etl for that day or modify the "daysToLoad" parameter so that it pulls the extra day. However, as I'm typing this, I'm thinking that that is probably the way to go. When we first launched our ODS a year ago, our prod ops team didn't have much experience with SQL Server and didn't want to perform any extra tasks for daily maintenance. This required us to try and build in some "self-healing" in the loads. Now that they are a little more comfortable with the platform, we can have them monitor the daily loads and perform the actions as needed. I will definitely make this proposal.

    Letron

  • Phil Parkin (1/11/2016)


    Is the load time spread evenly across all of the processes, or is there one particular bottleneck?

    The load times vary depending on the table being loaded. As mentioned above, we use a segmentation method to break the tables up into segment loads based on partition column. Some tables have very small record count so we don't bother segmenting those and bring them all over in one segment. Some tables have large record counts in hundreds of millions. We try to break those tables into smaller segments and load paralleled via a queue system explained in the above article 2. With that said we may have some segments with millions of rows that must be loaded which could take any where from 30min - 2 hours to load.

    As far as bottleneck is concerned, we have identified the source as a potential bottleneck. We can only load the data as fast as we are pulling it from the source. This is why in our segmentation method, we create an additional connection to the source for every segment load. This gives us the ability to pull multiple data streams from the source simultaneously.

    Letron

  • ...that would be the most efficient, but the only problem is that if there was an issue with a load that day where it failed, manual intervention would need to be made to re-run the etl for that day or modify the "daysToLoad" parameter so that it pulls the extra day

    That's one of the attractions of my suggestion; you would not have to manually intervene in the event of a failure.

    Eg, if load succeeds on 9 Jan and fails on 10 Jan, then when 11 Jan load runs:

    - Select max(date) from target returns 9 Jan

    - Query for source data becomes

    select cols from source where date >= (9 Jan)

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Phil Parkin (1/11/2016)


    ...that would be the most efficient, but the only problem is that if there was an issue with a load that day where it failed, manual intervention would need to be made to re-run the etl for that day or modify the "daysToLoad" parameter so that it pulls the extra day

    That's one of the attractions of my suggestion; you would not have to manually intervene in the event of a failure.

    Eg, if load succeeds on 9 Jan and fails on 10 Jan, then when 11 Jan load runs:

    - Select max(date) from target returns 9 Jan

    - Query for source data becomes

    select cols from source where date >= (9 Jan)

    Ahh duhh! Right... We currently don't check the target for max date. We use today's date in the ETL and pull based on that. That makes more sense though. It's so simple it eluded us.

  • You probably already know this, and you didn't specifically mention SCD or if this is a MERGE versus straight INSERT into Destination, but avoid using the SSIS Slowly Changing Dimension task for merging anything other than a few thousand rows between Source and Desination, because it loops and performs a Lookup for each row (otherwise known as "RBAR"). Other 3rd party implementation of the SCD task (ie: PragmaticWorks) perform batch comparisons between Source and Destination and perform orders of magnitude better, especially when selecting > 10,000 rows.

    http://dimensionmergescd.codeplex.com/

    http://pragmaticworks.com/Products/Task-Factory

    Even with a SELECT/INSERT you want to minimize the number of rows returned from the Source. It has already been suggested above by Phil that you leverage something like Max(Date) from Destination to limit only Source rows that have been inserted or updated since the previous load. A similar method is to implement a column of type ROWVERSION on each sourced table, which is automatically incremented sequentially whenever a source row is inserted or updated. The advantage of ROWVERSION is that you're not depending on the application to maintain the column and thus reduce the likelyhood that updates are missed.

    Using rowversion with SQL Server Integration Services to reduce load times

    https://www.mssqltips.com/sqlservertip/3295/using-rowversion-with-sql-server-integration-services-to-reduce-load-times

    Of course, regardless of what datestamp or rowversion implementation you choose, remember to create an index on it, which will speed up the retreival of rows from the Source.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell (1/11/2016)


    You probably already know this, and you didn't specifically mention SCD or if this is a MERGE versus straight INSERT into Destination, but avoid using the SSIS Slowly Changing Dimension task for merging anything other than a few thousand rows between Source and Desination, because it loops and performs a Lookup for each row (otherwise known as "RBAR"). Other 3rd party implementation of the SCD task (ie: PragmaticWorks) perform batch comparisons between Source and Destination and perform orders of magnitude better, especially when selecting > 10,000 rows.

    http://dimensionmergescd.codeplex.com/

    http://pragmaticworks.com/Products/Task-Factory

    Even with a SELECT/INSERT you want to minimize the number of rows returned from the Source. It has already been suggested above by Phil that you leverage something like Max(Date) from Destination to limit only Source rows that have been inserted or updated since the previous load. A similar method is to implement a column of type ROWVERSION on each sourced table, which is automatically incremented sequentially whenever a source row is inserted or updated. The advantage of ROWVERSION is that you're not depending on the application to maintain the column and thus reduce the likelyhood that updates are missed.

    Using rowversion with SQL Server Integration Services to reduce load times

    https://www.mssqltips.com/sqlservertip/3295/using-rowversion-with-sql-server-integration-services-to-reduce-load-times

    Of course, regardless of what datestamp or rowversion implementation you choose, remember to create an index on it, which will speed up the retreival of rows from the Source.

    I do not think that Oracle implements ROWVERSION.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Phil Parkin (1/11/2016)


    I do not think that Oracle implements ROWVERSION.

    Right Phil... I think Eric meant to actually implement the physical column on the source tables. Oracle does have a pseudocolumn called "rownum" but it only gets assigned a value at SELECT. Unfortunately we would not be allowed to implement a rowversion column on the source since it's so heavily controlled. That would be nice if we could.

    Letron

  • Eric M Russell (1/11/2016)


    You probably already know this, and you didn't specifically mention SCD or if this is a MERGE versus straight INSERT into Destination, but avoid using the SSIS Slowly Changing Dimension task for merging anything other than a few thousand rows between Source and Desination, because it loops and performs a Lookup for each row (otherwise known as "RBAR"). Other 3rd party implementation of the SCD task (ie: PragmaticWorks) perform batch comparisons between Source and Destination and perform orders of magnitude better, especially when selecting > 10,000 rows.

    http://dimensionmergescd.codeplex.com/

    http://pragmaticworks.com/Products/Task-Factory

    Thanks Eric. That's good information. Our incremental tables so far are all append tables so there will be no updates. However we are experimenting with replacing actual UPDATE statements in other updatable tables with DELETE/INSERT statements based on the natural key. We are finding that the set-based DELETE/INSERT performs better than the set-based UPDATE. We are still testing however.

    Letron

  • Phil Parkin (1/11/2016)


    ...

    I do not think that Oracle implements ROWVERSION.

    ROWVERSION (aka TIMESTAMP) is not ANSI standard, so it's implemented differently on each RDMS, but each has a similar concept. For Oracle there is a pseudo-column called ORA_ROWSCN, which is not part of table definition, but exposed by default and can be returned in SELECT statement. I'm not sure, but I think the values assined are sequential and not subject to change for a specific row unless there is a DML operation. If that's true, then it suits the purpose.

    ... For each row, ORA_ROWSCN returns the conservative upper bound system change number (SCN) of the most recent change to the row. This pseudocolumn is useful for determining approximately when a row was last updated...

    http://docs.oracle.com/cd/B19306_01/server.102/b14200/pseudocolumns007.htm

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • To address the situation of gap fill or other re-run situations I use a parameter table.

    One package to populate the parameters.

    One package to pick up the parameters and perform the source data extract.

    It seems like extra work to separate those out but stick with me.

    I also have a stored procedure that lets me manually set a parameter value. So for a gap fill I can set the parameters I want and let 'er rip.

    "But what about that 'get parameters' package? Won't it overwrite the ones you set manually?"

    Business logic my friend. The stored procedure updates a 'manual' flag field and the package has constraints with expressions so if there's a manual parameter it skips that part. And the extract will set the flag back to 0 after it completes.

    A gap fill is easy enough that way. Run the stored procedures to get your date then run the job again.

  • JustMarie (1/11/2016)


    To address the situation of gap fill or other re-run situations I use a parameter table.

    One package to populate the parameters.

    One package to pick up the parameters and perform the source data extract.

    It seems like extra work to separate those out but stick with me.

    I also have a stored procedure that lets me manually set a parameter value. So for a gap fill I can set the parameters I want and let 'er rip.

    "But what about that 'get parameters' package? Won't it overwrite the ones you set manually?"

    Business logic my friend. The stored procedure updates a 'manual' flag field and the package has constraints with expressions so if there's a manual parameter it skips that part. And the extract will set the flag back to 0 after it completes.

    A gap fill is easy enough that way. Run the stored procedures to get your date then run the job again.

    Thanks! Yeah we are actually working on a similar utility that we can schedule on a daily basis to fill the gaps if they happen. Hopefully we can avoid them if we do what Phil suggests and use the max date from the target which is actually a DW common practice. Sometimes common sense can escape me :o)

    Letron

  • Letron Brantley (1/12/2016)


    JustMarie (1/11/2016)


    To address the situation of gap fill or other re-run situations I use a parameter table.

    One package to populate the parameters.

    One package to pick up the parameters and perform the source data extract.

    It seems like extra work to separate those out but stick with me.

    I also have a stored procedure that lets me manually set a parameter value. So for a gap fill I can set the parameters I want and let 'er rip.

    "But what about that 'get parameters' package? Won't it overwrite the ones you set manually?"

    Business logic my friend. The stored procedure updates a 'manual' flag field and the package has constraints with expressions so if there's a manual parameter it skips that part. And the extract will set the flag back to 0 after it completes.

    A gap fill is easy enough that way. Run the stored procedures to get your date then run the job again.

    Thanks! Yeah we are actually working on a similar utility that we can schedule on a daily basis to fill the gaps if they happen. Hopefully we can avoid them if we do what Phil suggests and use the max date from the target which is actually a DW common practice. Sometimes common sense can escape me :o)

    Letron

    The base package to get parameters does go into the source and find the date of the most recent records. So it runs to keep things going. But when they need something special this is a way to get it done with a minimum of fuss for us and the DBAs.

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply