January 30, 2017 at 9:24 pm
Comments posted to this topic are about the item Loading Non Duplicated Data Using SSIS
January 30, 2017 at 10:34 pm
The problem is scalability. Aggregations are blocking transforms. Hence, they block the data flow in the pipeline and start bloating the memory (as they keep the entire result set in memory to perform the aggregation).
When tasked with millions or trillions of records, it starts to get ugly (your servers start running out of memory, paging is seen on disk and ultimately the package comes to a crawl).
The simplest approach in these cases would be to dump everything into a temporary SQL Server table, and then let T-SQL handle it. SQL is built for data manipulation and management, .NET objects (which is what an SSIS package translates to) are not.
Thanks & Regards,
Nakul Vachhrajani.
http://nakulvachhrajani.com
Follow me on
Twitter: @sqltwins
January 30, 2017 at 11:18 pm
It is more easy to use the Sort-task. Use the Check-box 'Remove rows with duplicate sort values'. It is a blocking task, but works fast even with millions of rows! It is also blocking if you write data to a temporary table, ..., ..., ... because the work must be done, before it is possible to continue. And writing and reading data to and from TempDB takes time!
January 30, 2017 at 11:38 pm
csj - Monday, January 30, 2017 11:18 PMIt is more easy to use the Sort-task. Use the Check-box 'Remove rows with duplicate sort values'. It is a blocking task, but works fast even with millions of rows! It is also blocking if you write data to a temporary table, ..., ..., ... because the work must be done, before it is possible to continue. And writing and reading data to and from TempDB takes time!
Yes, for smaller datasets (a couple of hundred rows), Sort is fine.
For larger datasets, we run into the same problem - Sort is a blocking transformation and therefore can become a performance bottleneck.
Thanks & Regards,
Nakul Vachhrajani.
http://nakulvachhrajani.com
Follow me on
Twitter: @sqltwins
January 30, 2017 at 11:51 pm
Nakul Vachhrajani - Monday, January 30, 2017 10:34 PMThe problem is scalability. Aggregations are blocking transforms. Hence, they block the data flow in the pipeline and start bloating the memory (as they keep the entire result set in memory to perform the aggregation).When tasked with millions or trillions of records, it starts to get ugly (your servers start running out of memory, paging is seen on disk and ultimately the package comes to a crawl).
The simplest approach in these cases would be to dump everything into a temporary SQL Server table, and then let T-SQL handle it. SQL is built for data manipulation and management, .NET objects (which is what an SSIS package translates to) are not.
I have to ask when the last time it was that you had "trillions of records" in a staging table to load into another table.
--Jeff Moden
Change is inevitable... Change for the better is not.
January 30, 2017 at 11:54 pm
Nakul Vachhrajani - Monday, January 30, 2017 11:38 PMcsj - Monday, January 30, 2017 11:18 PMIt is more easy to use the Sort-task. Use the Check-box 'Remove rows with duplicate sort values'. It is a blocking task, but works fast even with millions of rows! It is also blocking if you write data to a temporary table, ..., ..., ... because the work must be done, before it is possible to continue. And writing and reading data to and from TempDB takes time!Yes, for smaller datasets (a couple of hundred rows), Sort is fine.
For larger datasets, we run into the same problem - Sort is a blocking transformation and therefore can become a performance bottleneck.
So post the T-SQL alternative. 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
January 30, 2017 at 11:58 pm
Jeff Moden - Monday, January 30, 2017 11:51 PMNakul Vachhrajani - Monday, January 30, 2017 10:34 PMThe problem is scalability. Aggregations are blocking transforms. Hence, they block the data flow in the pipeline and start bloating the memory (as they keep the entire result set in memory to perform the aggregation).When tasked with millions or trillions of records, it starts to get ugly (your servers start running out of memory, paging is seen on disk and ultimately the package comes to a crawl).
The simplest approach in these cases would be to dump everything into a temporary SQL Server table, and then let T-SQL handle it. SQL is built for data manipulation and management, .NET objects (which is what an SSIS package translates to) are not.
I have to ask when the last time it was that you had "trillions of records" in a staging table to load into another table.
I encountered this as part of the product we are currently working on where we need to migrate enterprise data from a legacy on-premise application to the cloud. Sorts and Aggregations in SSIS became performance killers and we had to resort to doing all the data computations in staging (an on-premise SQL database) using T-SQL (joins and the rest).
At a simplistic level, we:
- Extracted all data to staging
- Transform: All cleansing and transformations done using T-SQL
- Load: SSIS was only used to blindly take data from the staging and pump it to the cloud platform
Thanks & Regards,
Nakul Vachhrajani.
http://nakulvachhrajani.com
Follow me on
Twitter: @sqltwins
January 31, 2017 at 3:05 am
Nakul Vachhrajani - Monday, January 30, 2017 11:38 PMcsj - Monday, January 30, 2017 11:18 PMIt is more easy to use the Sort-task. Use the Check-box 'Remove rows with duplicate sort values'. It is a blocking task, but works fast even with millions of rows! It is also blocking if you write data to a temporary table, ..., ..., ... because the work must be done, before it is possible to continue. And writing and reading data to and from TempDB takes time!Yes, for smaller datasets (a couple of hundred rows), Sort is fine.
For larger datasets, we run into the same problem - Sort is a blocking transformation and therefore can become a performance bottleneck.
I did not know what computer you are using, but on my small surface it takes less than a half minute for more than 6 mill rows. On a small server less than 10 seconds!!!!
Carsten
January 31, 2017 at 5:19 am
csj - Tuesday, January 31, 2017 3:05 AMNakul Vachhrajani - Monday, January 30, 2017 11:38 PMcsj - Monday, January 30, 2017 11:18 PMIt is more easy to use the Sort-task. Use the Check-box 'Remove rows with duplicate sort values'. It is a blocking task, but works fast even with millions of rows! It is also blocking if you write data to a temporary table, ..., ..., ... because the work must be done, before it is possible to continue. And writing and reading data to and from TempDB takes time!Yes, for smaller datasets (a couple of hundred rows), Sort is fine.
For larger datasets, we run into the same problem - Sort is a blocking transformation and therefore can become a performance bottleneck.I did not know what computer you are using, but on my small surface it takes less than a half minute for more than 6 mill rows. On a small server less than 10 seconds!!!!
Carsten
keep in mind, this is not only about number of rows, but number of columns and type of data to be processed.
I have never encounter a data to be imported, where it was simple and easy to identify the duplicates and remove them.
and even though my datasets have been under few thousand rows it took more than few minutes to process them on average.
January 31, 2017 at 6:48 am
My 6 mill rows have 11 columns - Firstname, Lastname, Street, Zipcode, .... with the right datatypes and it takes 8 sec from a table and 6.5 sec from a flatfile with no datatypes changed, so all columns is string with length 50
If I take the columns from the table 2 times, so I am sorting on 22 columns it takes 14 sec. My computer is a desktop with 32 GB RAM and 2 quadcore.
January 31, 2017 at 7:16 am
carsten.saastamoinen - Tuesday, January 31, 2017 6:48 AMMy 6 mill rows have 11 columns - Firstname, Lastname, Street, Zipcode, .... with the right datatypes and it takes 8 sec from a table and 6.5 sec from a flatfile with no datatypes changed, so all columns is string with length 50If I take the columns from the table 2 times, so I am sorting on 22 columns it takes 14 sec. My computer is a desktop with 32 GB RAM and 2 quadcore.
so let me get this straight, you are running a simple 11 columns query with no datatype matching/conversion or transforming on an 32GB dual CPU/ 8-core(which is if you count multithreading possible hit at least double vCPU count) monster
and you feel that your processing time is a typical scenario?
well I did a test run of simple select on 5 column table with all columns as varchar(50) except last which is varchar(8000) and record count 6.7million rows
and it takes 36 sec to simply select the data on a SQL VM with 4 CPU @ 2.40GHz and 8Gb RAM
January 31, 2017 at 7:46 am
vl1969-734655 - Tuesday, January 31, 2017 7:16 AMcarsten.saastamoinen - Tuesday, January 31, 2017 6:48 AMMy 6 mill rows have 11 columns - Firstname, Lastname, Street, Zipcode, .... with the right datatypes and it takes 8 sec from a table and 6.5 sec from a flatfile with no datatypes changed, so all columns is string with length 50If I take the columns from the table 2 times, so I am sorting on 22 columns it takes 14 sec. My computer is a desktop with 32 GB RAM and 2 quadcore.
so let me get this straight, you are running a simple 11 columns query with no datatype matching/conversion or transforming on an 32GB dual CPU/ 8-core(which is if you count multithreading possible hit at least double vCPU count) monster
and you feel that your processing time is a typical scenario?well I did a test run of simple select on 5 column table with all columns as varchar(50) except last which is varchar(8000) and record count 6.7million rows
and it takes 36 sec to simply select the data on a SQL VM with 4 CPU @ 2.40GHz and 8Gb RAM
We are talking about Stating DB - so typical a BI-solution. 32 GB and 8 Core is not a monster, but a small Server special when we are discussing loading million of rows. And it is not a normal scenario to find duplicates on VARCHAR(8000) columns, but more a normal scenario to find duplicates on 22 columns with different datatypes - a mix of VARCHAR(20), VARCHAR(40), INT, SMALLINT. DATE, ..... So 14 sec on 22 columns finding duplicates is a more realistic scenario.
January 31, 2017 at 7:51 am
I appreciate learning about new widgets in SSIS as I'm still ramping up with the tool. This and the Sort technique may come in handy someday and I thank the posters for these tips.
Performance is something you can argue until the cows come in. It's good to keep it in mind but will be different for almost every situation.
January 31, 2017 at 3:44 pm
Nakul Vachhrajani - Monday, January 30, 2017 11:58 PMJeff Moden - Monday, January 30, 2017 11:51 PMNakul Vachhrajani - Monday, January 30, 2017 10:34 PMThe problem is scalability. Aggregations are blocking transforms. Hence, they block the data flow in the pipeline and start bloating the memory (as they keep the entire result set in memory to perform the aggregation).When tasked with millions or trillions of records, it starts to get ugly (your servers start running out of memory, paging is seen on disk and ultimately the package comes to a crawl).
The simplest approach in these cases would be to dump everything into a temporary SQL Server table, and then let T-SQL handle it. SQL is built for data manipulation and management, .NET objects (which is what an SSIS package translates to) are not.
I have to ask when the last time it was that you had "trillions of records" in a staging table to load into another table.
I encountered this as part of the product we are currently working on where we need to migrate enterprise data from a legacy on-premise application to the cloud. Sorts and Aggregations in SSIS became performance killers and we had to resort to doing all the data computations in staging (an on-premise SQL database) using T-SQL (joins and the rest).
At a simplistic level, we:
- Extracted all data to staging
- Transform: All cleansing and transformations done using T-SQL
- Load: SSIS was only used to blindly take data from the staging and pump it to the cloud platform
Understood and agreed but "trillions" of rows in a staging table?
--Jeff Moden
Change is inevitable... Change for the better is not.
January 31, 2017 at 9:56 pm
Shifting gears a bit, I wouldn't use SSIS for any of this other than maybe a process controller. Between things like WITH IGNORE DUPLICATES as a part of an index definition and some very fast methods for identifying duplicates in T-SQL along with possibly taking advantage of some high-speed minimal logging techniques, I'd write a stored procedure for this.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 1 through 14 (of 14 total)
You must be logged in to reply to this topic. Login to reply