July 23, 2018 at 7:52 am
I have created a package for our company that takes all the data from a table in our ERP production environment (~18million rows, 168fields) and copies it into our staging environment so that when we apply all of our transformations on the data for our warehouse, we are not competing with our ERP environment. The package currently takes 45-50 minutes to move the data and I am trying to cut that time down as much as possible.
The server hosting the source database is running with:
CPU: Intel(R) Xeon(R) CPU (2.93GHz - 4 processors)
Windows Server 2012 R2 Datacenter
16GB of RAM
The destination:
CPU: AMD Opteron(TM) Processor 6234 (2.40GHz - 4 processors)
Windows Server 2016 Datacenter
16GB of RAM
I have already tried playing with the Rows per batch, Maximum insert commit size (our destination DB is in simple recovery so I don't think this should change much since we are not using the transaction log), and the AutoAdjustBufferSize property and nothing has changed really. I am thinking that the issue lies with our destination server capabilities since most of the time is coming from the write transactions.
Any tips would be much appreciated, thanks in advance!
July 23, 2018 at 7:59 am
Just how much data is in 18 million rows of 168 columns ? What speed is the network path between the servers? How much RAM and CPU is on the source server?
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
July 23, 2018 at 8:10 am
Does your target table contain any indexes or keys (primary or foreign)?
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
July 23, 2018 at 8:12 am
sgmunson - Monday, July 23, 2018 7:59 AMJust how much data is in 18 million rows of 168 columns ? What speed is the network path between the servers? How much RAM and CPU is on the source server?
The properties for the table:
Data space: 27GB
Index space: 6.8GB
And I apologize, the above server configuration was for the source server (using MS SQL Server Standard(64-bit)
The destination server has the same RAM/# of processors but is running using Microsoft SQL Server Developer(64-bit) on Windows Server 2016 Datacenter(10.0)
The processor for the destination is AMD Opteron(TM) Processor 6234 (2.40GHz)
July 23, 2018 at 8:19 am
Phil Parkin - Monday, July 23, 2018 8:10 AMDoes your target table contain any indexes or keys (primary or foreign)?
The destination table does not have any keys, our staging environment is using a truncate/reload method.
The source table contains a non-clustered primary key on the ROWID
July 23, 2018 at 8:28 am
So is your package as simple as just a
truncate target
followed by a simple source/target dataflow?
Or does it contain any other transformations?
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
July 23, 2018 at 8:36 am
smantia2292 - Monday, July 23, 2018 8:12 AMsgmunson - Monday, July 23, 2018 7:59 AMJust how much data is in 18 million rows of 168 columns ? What speed is the network path between the servers? How much RAM and CPU is on the source server?The properties for the table:
Data space: 27GB
Index space: 6.8GBAnd I apologize, the above server configuration was for the source server (using MS SQL Server Standard(64-bit)
The destination server has the same RAM/# of processors but is running using Microsoft SQL Server Developer(64-bit) on Windows Server 2016 Datacenter(10.0)
The processor for the destination is AMD Opteron(TM) Processor 6234 (2.40GHz)
Useful info, but what about the network path ? What is the slowest network speed at any point in between the two servers?
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
July 23, 2018 at 8:40 am
Phil Parkin - Monday, July 23, 2018 8:28 AMSo is your package as simple as just a
truncate target
followed by a simple source/target dataflow?Or does it contain any other transformations?
Yes that is pretty much it!
July 23, 2018 at 10:15 am
Rows per batch should be set to the total number of rows on the table.
Set the table lock checkbox on the staging table
Set the maximum insert commit size to 2147483647
If you have any indexes on the staging table it would be best to drop them before performing this import operation then recreate them at the end.
For the Access mode use OpenRowset Using FastLoad.
Once you have got this data once you should then only import data that has changed on the table. This is possible if you store a datestamp column on the table which indicates when the row was last changed.
All that said it might not be possible to increase the speed a lot more if you are already at the limits of your hardware.
July 23, 2018 at 10:33 am
Jonathan AC Roberts - Monday, July 23, 2018 10:15 AMRows per batch should be set to the total number of rows on the table.
Regardless of the number of rows to be committed?
Can you justify this, please?
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
July 24, 2018 at 4:18 am
Phil Parkin - Monday, July 23, 2018 10:33 AMJonathan AC Roberts - Monday, July 23, 2018 10:15 AMRows per batch should be set to the total number of rows on the table.Regardless of the number of rows to be committed?
Can you justify this, please?
The entire table is being imported.
July 24, 2018 at 4:59 am
Jonathan AC Roberts - Tuesday, July 24, 2018 4:18 AMPhil Parkin - Monday, July 23, 2018 10:33 AMJonathan AC Roberts - Monday, July 23, 2018 10:15 AMRows per batch should be set to the total number of rows on the table.Regardless of the number of rows to be committed?
Can you justify this, please?The entire table is being imported.
So do you think that all of those 18M rows will fit in 16GB of (shared) RAM? Unlikely, I would say. So they're going to spill to tempdb (I think that's what happens, anyway). Having a lower rows per batch (and corresponding max insert commit size) allows things to flow into the target table without the need for spills. But testing is required, as there are several other areas where bottlenecks can occur.
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
July 24, 2018 at 5:22 am
Phil Parkin - Tuesday, July 24, 2018 4:59 AMSo do you think that all of those 18M rows will fit in 16GB of (shared) RAM? Unlikely, I would say. So they're going to spill to tempdb (I think that's what happens, anyway). Having a lower rows per batch (and corresponding max insert commit size) allows things to flow into the target table without the need for spills. But testing is required, as there are several other areas where bottlenecks can occur.
I think the rows are written using a data pump behind the scenes. If rows per batch are set to the number of rows on the table then it fails midway though then the entire transaction would be rolled back. However this can be an advantage as if it fails when batch size is set to less than the number of rows on the table then when the job is restarted it would import the same rows again so you would get duplicates unless you do a truncate first as SSIS checkpoint files will make it restart from the same control. It would use more memory, but as you say testing is required and maybe a lower value is needed.
Viewing 13 posts - 1 through 12 (of 12 total)
You must be logged in to reply to this topic. Login to reply