Loading Non Duplicated Data Using SSIS

Question

Loading Non Duplicated Data Using SSIS

rahulg_p

Valued Member

Points: 54
More actions
January 30, 2017 at 9:24 pm

#318251

Comments posted to this topic are about the item Loading Non Duplicated Data Using SSIS

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply

Nakul Vachhrajani SSChampion Points: 10294 More actions · Answer 1

The problem is scalability. Aggregations are blocking transforms. Hence, they block the data flow in the pipeline and start bloating the memory (as they keep the entire result set in memory to perform the aggregation).

When tasked with millions or trillions of records, it starts to get ugly (your servers start running out of memory, paging is seen on disk and ultimately the package comes to a crawl).

The simplest approach in these cases would be to dump everything into a temporary SQL Server table, and then let T-SQL handle it. SQL is built for data manipulation and management, .NET objects (which is what an SSIS package translates to) are not.

Thanks & Regards,
Nakul Vachhrajani.
http://nakulvachhrajani.com

Follow me on
Twitter: @sqltwins

csj Old Hand Points: 376 More actions · Answer 2

It is more easy to use the Sort-task. Use the Check-box 'Remove rows with duplicate sort values'. It is a blocking task, but works fast even with millions of rows! It is also blocking if you write data to a temporary table, ..., ..., ... because the work must be done, before it is possible to continue. And writing and reading data to and from TempDB takes time!

Nakul Vachhrajani SSChampion Points: 10294 More actions · Answer 3

csj - Monday, January 30, 2017 11:18 PM
It is more easy to use the Sort-task. Use the Check-box 'Remove rows with duplicate sort values'. It is a blocking task, but works fast even with millions of rows! It is also blocking if you write data to a temporary table, ..., ..., ... because the work must be done, before it is possible to continue. And writing and reading data to and from TempDB takes time!

Yes, for smaller datasets (a couple of hundred rows), Sort is fine.
For larger datasets, we run into the same problem - Sort is a blocking transformation and therefore can become a performance bottleneck.

Thanks & Regards,
Nakul Vachhrajani.
http://nakulvachhrajani.com

Follow me on
Twitter: @sqltwins

Jeff Moden SSC Guru Points: 1004484 More actions · Answer 4

Nakul Vachhrajani - Monday, January 30, 2017 10:34 PM
The problem is scalability. Aggregations are blocking transforms. Hence, they block the data flow in the pipeline and start bloating the memory (as they keep the entire result set in memory to perform the aggregation).
When tasked with millions or trillions of records, it starts to get ugly (your servers start running out of memory, paging is seen on disk and ultimately the package comes to a crawl).
The simplest approach in these cases would be to dump everything into a temporary SQL Server table, and then let T-SQL handle it. SQL is built for data manipulation and management, .NET objects (which is what an SSIS package translates to) are not.

I have to ask when the last time it was that you had "trillions of records" in a staging table to load into another table.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004484 More actions · Answer 5

Nakul Vachhrajani - Monday, January 30, 2017 11:38 PM
csj - Monday, January 30, 2017 11:18 PM
It is more easy to use the Sort-task. Use the Check-box 'Remove rows with duplicate sort values'. It is a blocking task, but works fast even with millions of rows! It is also blocking if you write data to a temporary table, ..., ..., ... because the work must be done, before it is possible to continue. And writing and reading data to and from TempDB takes time!
Yes, for smaller datasets (a couple of hundred rows), Sort is fine.
For larger datasets, we run into the same problem - Sort is a blocking transformation and therefore can become a performance bottleneck.

So post the T-SQL alternative. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Nakul Vachhrajani SSChampion Points: 10294 More actions · Answer 6

Jeff Moden - Monday, January 30, 2017 11:51 PM
Nakul Vachhrajani - Monday, January 30, 2017 10:34 PM
The problem is scalability. Aggregations are blocking transforms. Hence, they block the data flow in the pipeline and start bloating the memory (as they keep the entire result set in memory to perform the aggregation).
When tasked with millions or trillions of records, it starts to get ugly (your servers start running out of memory, paging is seen on disk and ultimately the package comes to a crawl).
The simplest approach in these cases would be to dump everything into a temporary SQL Server table, and then let T-SQL handle it. SQL is built for data manipulation and management, .NET objects (which is what an SSIS package translates to) are not.
I have to ask when the last time it was that you had "trillions of records" in a staging table to load into another table.

I encountered this as part of the product we are currently working on where we need to migrate enterprise data from a legacy on-premise application to the cloud. Sorts and Aggregations in SSIS became performance killers and we had to resort to doing all the data computations in staging (an on-premise SQL database) using T-SQL (joins and the rest).

At a simplistic level, we:
- Extracted all data to staging
- Transform: All cleansing and transformations done using T-SQL
- Load: SSIS was only used to blindly take data from the staging and pump it to the cloud platform

Thanks & Regards,
Nakul Vachhrajani.
http://nakulvachhrajani.com

Follow me on
Twitter: @sqltwins

csj Old Hand Points: 376 More actions · Answer 7

Nakul Vachhrajani - Monday, January 30, 2017 11:38 PM
csj - Monday, January 30, 2017 11:18 PM
It is more easy to use the Sort-task. Use the Check-box 'Remove rows with duplicate sort values'. It is a blocking task, but works fast even with millions of rows! It is also blocking if you write data to a temporary table, ..., ..., ... because the work must be done, before it is possible to continue. And writing and reading data to and from TempDB takes time!
Yes, for smaller datasets (a couple of hundred rows), Sort is fine.
For larger datasets, we run into the same problem - Sort is a blocking transformation and therefore can become a performance bottleneck.

I did not know what computer you are using, but on my small surface it takes less than a half minute for more than 6 mill rows. On a small server less than 10 seconds!!!!
Carsten

vl1969-734655 Mr or Mrs. 500 Points: 545 More actions · Answer 8

csj - Tuesday, January 31, 2017 3:05 AM
Nakul Vachhrajani - Monday, January 30, 2017 11:38 PM
csj - Monday, January 30, 2017 11:18 PM
It is more easy to use the Sort-task. Use the Check-box 'Remove rows with duplicate sort values'. It is a blocking task, but works fast even with millions of rows! It is also blocking if you write data to a temporary table, ..., ..., ... because the work must be done, before it is possible to continue. And writing and reading data to and from TempDB takes time!
Yes, for smaller datasets (a couple of hundred rows), Sort is fine.
For larger datasets, we run into the same problem - Sort is a blocking transformation and therefore can become a performance bottleneck.
I did not know what computer you are using, but on my small surface it takes less than a half minute for more than 6 mill rows. On a small server less than 10 seconds!!!!
Carsten

keep in mind, this is not only about number of rows, but number of columns and type of data to be processed.
I have never encounter a data to be imported, where it was simple and easy to identify the duplicates and remove them.
and even though my datasets have been under few thousand rows it took more than few minutes to process them on average.

carsten.saastamoinen Old Hand Points: 341 More actions · Answer 9

My 6 mill rows have 11 columns - Firstname, Lastname, Street, Zipcode, .... with the right datatypes and it takes 8 sec from a table and 6.5 sec from a flatfile with no datatypes changed, so all columns is string with length 50

If I take the columns from the table 2 times, so I am sorting on 22 columns it takes 14 sec. My computer is a desktop with 32 GB RAM and 2 quadcore.

vl1969-734655 Mr or Mrs. 500 Points: 545 More actions · Answer 10

carsten.saastamoinen - Tuesday, January 31, 2017 6:48 AM
My 6 mill rows have 11 columns - Firstname, Lastname, Street, Zipcode, .... with the right datatypes and it takes 8 sec from a table and 6.5 sec from a flatfile with no datatypes changed, so all columns is string with length 50
If I take the columns from the table 2 times, so I am sorting on 22 columns it takes 14 sec. My computer is a desktop with 32 GB RAM and 2 quadcore.

so let me get this straight, you are running a simple 11 columns query with no datatype matching/conversion or transforming on an 32GB dual CPU/ 8-core(which is if you count multithreading possible hit at least double vCPU count) monster
and you feel that your processing time is a typical scenario?

well I did a test run of simple select on 5 column table with all columns as varchar(50) except last which is varchar(8000) and record count 6.7million rows
and it takes 36 sec to simply select the data on a SQL VM with 4 CPU @ 2.40GHz and 8Gb RAM

carsten.saastamoinen Old Hand Points: 341 More actions · Answer 11

vl1969-734655 - Tuesday, January 31, 2017 7:16 AM
carsten.saastamoinen - Tuesday, January 31, 2017 6:48 AM
My 6 mill rows have 11 columns - Firstname, Lastname, Street, Zipcode, .... with the right datatypes and it takes 8 sec from a table and 6.5 sec from a flatfile with no datatypes changed, so all columns is string with length 50
If I take the columns from the table 2 times, so I am sorting on 22 columns it takes 14 sec. My computer is a desktop with 32 GB RAM and 2 quadcore.
so let me get this straight, you are running a simple 11 columns query with no datatype matching/conversion or transforming on an 32GB dual CPU/ 8-core(which is if you count multithreading possible hit at least double vCPU count) monster
and you feel that your processing time is a typical scenario?
well I did a test run of simple select on 5 column table with all columns as varchar(50) except last which is varchar(8000) and record count 6.7million rows
and it takes 36 sec to simply select the data on a SQL VM with 4 CPU @ 2.40GHz and 8Gb RAM

We are talking about Stating DB - so typical a BI-solution. 32 GB and 8 Core is not a monster, but a small Server special when we are discussing loading million of rows. And it is not a normal scenario to find duplicates on VARCHAR(8000) columns, but more a normal scenario to find duplicates on 22 columns with different datatypes - a mix of VARCHAR(20), VARCHAR(40), INT, SMALLINT. DATE, ..... So 14 sec on 22 columns finding duplicates is a more realistic scenario.

Mike Giuffre Old Hand Points: 334 More actions · Answer 12

I appreciate learning about new widgets in SSIS as I'm still ramping up with the tool. This and the Sort technique may come in handy someday and I thank the posters for these tips.

Performance is something you can argue until the cows come in. It's good to keep it in mind but will be different for almost every situation.

Jeff Moden SSC Guru Points: 1004484 More actions · Answer 13

Nakul Vachhrajani - Monday, January 30, 2017 11:58 PM
Jeff Moden - Monday, January 30, 2017 11:51 PM
Nakul Vachhrajani - Monday, January 30, 2017 10:34 PM
The problem is scalability. Aggregations are blocking transforms. Hence, they block the data flow in the pipeline and start bloating the memory (as they keep the entire result set in memory to perform the aggregation).
When tasked with millions or trillions of records, it starts to get ugly (your servers start running out of memory, paging is seen on disk and ultimately the package comes to a crawl).
The simplest approach in these cases would be to dump everything into a temporary SQL Server table, and then let T-SQL handle it. SQL is built for data manipulation and management, .NET objects (which is what an SSIS package translates to) are not.
I have to ask when the last time it was that you had "trillions of records" in a staging table to load into another table.
I encountered this as part of the product we are currently working on where we need to migrate enterprise data from a legacy on-premise application to the cloud. Sorts and Aggregations in SSIS became performance killers and we had to resort to doing all the data computations in staging (an on-premise SQL database) using T-SQL (joins and the rest).
At a simplistic level, we:
- Extracted all data to staging
- Transform: All cleansing and transformations done using T-SQL
- Load: SSIS was only used to blindly take data from the staging and pump it to the cloud platform

Understood and agreed but "trillions" of rows in a staging table?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004484 More actions · Answer 14

Shifting gears a bit, I wouldn't use SSIS for any of this other than maybe a process controller. Between things like WITH IGNORE DUPLICATES as a part of an index definition and some very fast methods for identifying duplicates in T-SQL along with possibly taking advantage of some high-speed minimal logging techniques, I'd write a stored procedure for this.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)