SSIS Merge Join Must Use SSIS Sort Component!

  • Many online articles and blog posts suggest sorting in the database and then setting the IsSorted property on the corresponding SSIS data source in order to satisfy certain SSIS components' requirement that the inputs must be sorted. I've found that in the case of the Merge Join component (and maybe other components as well), this doesn't always work as intended.

    Here's an example you can build yourself...

    - Locate (or create) a SQL Server database with the SQL_Latin1_General_CP1_CI_AS (case insensitive) collation

    - Create a data flow with two OLEDB sources, pointing to these two queries. Note that query #2 is identical to query #1, with one extra row.

    -- Query #1

    SELECT 'ABb' AS Col1,

    1 AS Query1RowNumber

    UNION ALL

    SELECT 'AbB',

    2

    ORDER BY Col1

    -- Query #2

    SELECT 'ABb' AS Col1,

    1 AS Query2RowNumber

    UNION ALL

    SELECT 'AbB',

    2

    UNION ALL

    SELECT 'Ab',

    3

    ORDER BY Col1

    - Using the advanced editor, set the "IsSorted" property on each data source, and set the "SortOrder" to 1 for the Col1 column

    - Add a merge join component, join type "full outer join". Make "Col1" the join key, and allow Query1RowNumber and Query2RowNumber to pass through.

    - Add a destination of some kind, and put a data viewer on the output of the merge join

    The expected result is this, right?

    Query1RowNumber Query2RowNumber

    --------------- ---------------

    1 1

    2 2

    NULL 3

    What you actually get is:

    Query1RowNumber Query2RowNumber

    --------------- ---------------

    NULL 3

    NULL 2

    1 1

    2 NULL

    Row number 2 (Col1 = 'AbB') isn't matched in the full outer join, and it is returned as two "unique" rows!

    My conclusion from this is that SSIS components which require sorted inputs actually require "SSIS-sorted" inputs, or they can return unexpected results!

    One last thing to mention... This issue only manifests itself when sorting character data that sorts differently in SQL versus SSIS. If your key columns are non-alpha, or have no punctuation, mixed-case, accented, or special characters, you should be safe from this issue.

  • Is the datatype for Query1RowNumber & Query2RowNumber exactly the same. I'm not asking if the query is the same but the underlying datatype of that column. I know that SSIS tends to be very sensitive to such things such as '2' != '2 '..

    CEWII

  • Yes, the datatypes are exactly the same. According to the Metadata viewer, both "Col1" columns are DT_STR, length 3, Code Page 1252. Both RowNumber columns are DT_I4.

  • This can be solved by using collations:

    http://microsoft-ssis.blogspot.com/2011/03/sorting-in-sql-vs-sorting-in-ssis.html

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply