August 17, 2011 at 10:49 am
I have been tasked with improving the speed of a data pull from iSeries DB2 V5R4 into a stage table on SQL Server 2008 R2. I've been working with SQL Server and SSIS for about 6 months. I have very limited knowledge of iSeries DB2 and we don't have a DB2 DBA in house. I have however worked with relational data bases and data warehousing for several years so I'm using tuning techniques from my past experience to this problem. I'm not making much head way so I'm hoping some of you might have suggestions of other things to try before I throw my hands up and ask for some outside DB2 help.
Currently the data pull is accomplished via an OLE DB Source that uses an OLE DB connection defined to use the "Native OLE DB\IBM DB2 UDB for iSeries IBMDA400 OLE DB Provider". The OLE DB Source uses a SQL Command to pull a subset of columns from the source table. 53 columns are pulled with an estimated row size of 220. The source table contains over 74 million rows and is growing. We have to pull all rows from this table as there is currently no way to determine changes on the source system. Right now the execution time is a little over 10 hours.
I've set up a test SSIS package data flow that pulls 601,179 rows from the iSeries source using an OLE DB source (IBMDA400 provider) and feeds it to a Multicast component. The baseline execution time for this package is 3 minutes 49 seconds. Using this package I've tested the following
1) Added WITH NC to the query. This should have the same "read only" impact as the T-SQL command WITH (NOLOCK). No change to execution time. I left WITH NC in the query for all subsequent tests.
2) Change the provider to IBMDASQL and IBMDARLA. IBMDASQL was 4 seconds faster than the baseline. IBMDARLA was slower than the other 2 providers. I left the OLE DB connection provider as IBMDASQL in subsequent test.
3) Change configuration of the OLE DB connection to change the Block Size from 0. I tried block sizes of 256, 512 and 8192. All executions were slower than leaving the Block Size at 0.
4) Use an ADO.net connection with ODBC to pull the data. This was about 3 times slower than using the IBMDASQL OLE DB provider.
5) Use an ADO.net connection with .net for OLEDB provider IBMDASQL to pull the data. This was about 30 seconds slower than the baseline.
6) Split the data flow into 4 different data flows each pulling a subset of the 601,179 rows. The source column used in the where clause is a numeric data type and is the first column of a index defined on the source table. I used an OLE DB connection with IBMDASQL provider. No difference in the execution time from a single data flow using the same connection (#2 above).
One of the things I noticed while watching the package execute is that when I used an OLE DB connection the number of rows display changed every 4 seconds by 9,972 rows. Using the ADO.net connect with .net for OLEDB provider the number of rows changed every 4 seconds by 9,995 rows. And with an ADO.net connection with ODBC the display changed by 9,995 rows every 10 seconds. So I'm wondering if there is something else to configure on the OLE DB and ADO.net connections that would change the number of rows pulled and how often?
I would appreciate hearing any ideas on ways to pull 74+ million rows from iSeries into SQL Server in less than 10 hours. This is my first posting so if I have left anything out or should format this posting differently please let me know.
August 18, 2011 at 8:04 am
I would do one thing first. Take your source, you can test all your different configurations, and just route it to a Row Count transform and nothing else and see how long that takes in each source configuration. This way you are removing the influence of any other transform on how long the entire package takes.
I had one package where I had a very similar issue. Weeks of troubleshooting to try to improve load times yielded very little substantive improvement. Then out of curiosity I tried dropping one of the indexes (non clustered) on my destination and the speed improvement was overwhelming. I went from roughly a 10 hour load to about a 10 minute load. So the lesson I learned, is I can't always blame the source for bad performance.
Good luck.
August 18, 2011 at 8:20 am
We are doing the exact same thing here daily. We found that to do larger tables which in our case is 46 M rows, we had to go to SSIS packages. It loaded more efficiently than doing the OPENROWSET that we are doing for most other tables.
Steve Jimmo
Sr DBA
“If we ever forget that we are One Nation Under God, then we will be a Nation gone under." - Ronald Reagan
August 18, 2011 at 8:42 am
After some additional research I tried changing the OLE DB provider from IBM to Microsoft OLE DB for DB2. Using the Microsoft driver the test package ran in 1 minute 20 seconds instead of the 3 minutes 49 seconds using the IBMDA400 provider. Almost a 3 times improvement! I had to use the advanced properties of the OLE DB connection to get the parameters set up correctly but it seems to be blazing fast now. The production load is scheduled to execute later today so we are making the driver change now and keeping our fingers crossed.
Here's the connection string from the properties of the Microsoft provider OLE DB connection.
Data Source=QS1042528;User ID=xxxx;Initial Catalog=SAM400;Provider=DB2OLEDB;Network Address=QS1042528;Package Collection=PRODLIB;Default Schema=PRODLIB;
August 18, 2011 at 8:55 am
The only other suggestion I could add over and above what was already said, is to do as little formatting as possible.
Make all your destination columns varchar...you rather want to do all formatting and validations on the SQL side.
As Daniel said, try and eliminate all indexes on your destination tables. I have found that even an identity field in a staging table can impact performance significantly.
August 18, 2011 at 3:02 pm
We do have both an identity column and indexes on the target table. I'll look at those next.
By using the Microsoft OLE DB for DB2 provider the production process execution time went from 10+ hours to 5 hours and 40 minutes to load 74,900,000 rows. Yay!!
Thanks for your suggestions.
November 21, 2011 at 1:14 pm
I'm going to converting large data from an AS400 DB2 iSeries to SQL Server.
I will be moving part of the data based on when the contract is renewed.
I want to make sure that I minimize the probability of locking the DB2 Tables and interfering with any other processes. I'm relatively new and things are a bit scattered so I do not know what all is going on
Also it would be nice to identify what the indexes are on the AS400 tables but that is not possible because the database is proprietary, resides at the vendors site, and we have minimal permissions.
Any suggestions in addition to what was already suggest would be greatly appreciated.
For better, quicker answers on T-SQL questions, click on the following...
http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better answers on performance questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/
May 9, 2014 at 2:03 pm
nmccormick (8/18/2011)
We do have both an identity column and indexes on the target table. I'll look at those next.By using the Microsoft OLE DB for DB2 provider the production process execution time went from 10+ hours to 5 hours and 40 minutes to load 74,900,000 rows. Yay!!
Thanks for your suggestions.
We used IBM Iseries driver and were able to load 79,000,000 GL table in about 4.5 hours. So I would say you are getting pretty good performance, given the fact that you are moving close to half a terabyte of a data.
Viewing 8 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply