Best way to handle grouping in SSIS ?

  • I have a set of flat files that i am iteratively loading using the foreach loop container. The load is pretty straight forward - for 10 columns from the source (CSV) dump into 10 destination columns (SQL Server 2008 DB table).

    I have another request, i.e. to group a PersonId and MAX(TransactionDate) into another table.

    1. Can i perform this as an in-memory transformation using an existing task like Grouping task? Or is there a better solution ?

    a) how do i perform a max(TransactionDate) in SSIS?

    2. Do i have to create two different source components or can i use the same source and use a multicast to select only the 2 columns i need ?

    Structure of the (grouped data) destination table and definition of columns:

    PersonId (int) | TransactionDate (datetime2(7))

    Any ideas appreciated. Thank you.

  • 1) Yes you can, using the Aggregate task. This lets you do effectively the same thing as GROUP BY in SQL. It's not really recommended though, unless you have no other options. This is because the Aggregate task is a blocking operation, meaning rows can't flow out of the task until *all* rows from the source have flowed in to it. Depending on the number of records coming in, this could crash your computer, as blocking tasks don't release the memory until after they are completed.

    A better option would be to introduce your grouping operations into your source instead, and handle the grouping at the SQL level.

    2) Same as in 1). You would use the Aggregate task.

    3) You can use multicast. Your CSV source will feed the rows into the multicast, then one branch will write the full values into your destination table, while the other will feed the rows into the Aggregate transform, which will then output to your aggregate table.

    What I would recommend you do though, is split your task into two. In the first task, you take all the rows from your CSV file, and write them to your destination table.

    Then, in your second task, you would have as your source a SQL query against your destination table from the first step, grouping the records by PersonID and getting the MAX(TransactionDate), then writing those records to the destination. Another option would be to just use an Execute SQL Task, and execute an INSERT INTO - SELECT FROM command, or a stored proc which does the same.

  • The data that i am dealing with is >500 Million rows for one table (source split as multiple files) x 20 tables. Which is why i was trying to avoid a re-scan of the data and wanted to perform this grouping as a parallel transform.

    Will compare the aggregate and the SQL option to see which works better. Thanks for the suggestion.

  • Unless you have a fairly powerful computer, I can almost guarantee that using the Aggregate task *will* cause it to crash.

    For reference, the biggest working set I've dealt with is a ETL that involved three lookup tasks, consisting of a total of about ~20 million records, and a non-blocking import-export task taking CSV files that could contain up to 5 million records.

    That ETL consumed roughly 10gb of memory while running.

    A blocking transform with 500 million records ... that will take quite a lot of memory to run.

    SQL Server is quite good at dealing with that workload, on the other hand.

  • Wow. That is some good insight. Thanks.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply