Help with Query Results and Collation

  • I should know the answer to these questions, but I'm having trouble with a couple of my queries.

    First, my goal here is to compare data from a table in a source database with data from a table in another database to see whether there are any differences. Here's a simple example:

    DECLARE @TableA TABLE
    (ID int,
    label VARCHAR(50)
    )

    DECLARE @TableB TABLE
    (ID int,
    label VARCHAR(50)
    )

    INSERT INTO @TableA VALUES (1, NULL)
    INSERT INTO @TableA VALUES (2, 'test')
    INSERT INTO @TableA VALUES (3, 'works')
    INSERT INTO @TableA VALUES (5, 'inonlytablea')


    INSERT INTO @TableB VALUES (1, NULL)
    INSERT INTO @TableB VALUES (2, 'test')
    INSERT INTO @TableB VALUES (3, 'works')
    INSERT INTO @TableB VALUES (4, 'inonlytableb')

    SELECT *
    FROM (
    SELECT * from @TableA
    EXCEPT
    SELECT * from @TableB
    UNION ALL
    SELECT * from @TableB
    EXCEPT
    SELECT * from @TableA
    ) t1

    Obviously there's a difference in each direction. I expect:

    ID, label

    5, 'inonlytablea'

    4, 'inonlytableb'

    However, what I get is:

    ID, label

    4, 'inonlytableb'

    That's problem #1. Again, I'm sure it's something dumb, but I just came across it in trying to come up with an example to demonstrate the second issue I'm having, which is that the source table's columns are all of the Latin1_General_BIN collation whereas the target table's are SQL_Latin1_General_CP1_CI_AS.

    How would you address this? I mean, I know I could do something like:

    DECLARE @TableA TABLE
    (ID int,
    label VARCHAR(50) COLLATE Latin1_General_BIN
    )

    DECLARE @TableB TABLE
    (ID int,
    label VARCHAR(50)
    )

    INSERT INTO @TableA VALUES (1, NULL)
    INSERT INTO @TableA VALUES (2, 'test')
    INSERT INTO @TableA VALUES (3, 'works')
    INSERT INTO @TableA VALUES (5, 'inonlytablea')


    INSERT INTO @TableB VALUES (1, NULL)
    INSERT INTO @TableB VALUES (2, 'test')
    INSERT INTO @TableB VALUES (3, 'works')
    INSERT INTO @TableB VALUES (4, 'inonlytableb')

    SELECT *
    FROM (
    SELECT ID, label from @TableA
    EXCEPT
    SELECT ID, label COLLATE Latin1_General_BIN from @TableB
    UNION ALL
    SELECT ID, label COLLATE Latin1_General_BIN from @TableB
    EXCEPT
    SELECT ID, label from @TableA
    ) t1

    but that doesn't seem to be great for performance. And in reality, we're talking about thousands of columns across 100s of tables.

    And my third problem...let's say there's a good solution to the collation conflict. Some of these source fields are of "text" data type, which cannot be used in an EXCEPT. So what is a good way I can compare these fields and return the delta so I can synchronize the target to the source? Here's an example of the issue--similar to the one above:

    DECLARE @TableA TABLE
    (ID int,
    label text
    )

    DECLARE @TableB TABLE
    (ID int,
    label text
    )

    INSERT INTO @TableA VALUES (1, NULL)
    INSERT INTO @TableA VALUES (2, 'test')
    INSERT INTO @TableA VALUES (3, 'works')
    INSERT INTO @TableA VALUES (5, 'inonlytablea')


    INSERT INTO @TableB VALUES (1, NULL)
    INSERT INTO @TableB VALUES (2, 'test')
    INSERT INTO @TableB VALUES (3, 'works')
    INSERT INTO @TableB VALUES (4, 'inonlytableb')

    SELECT *
    FROM (
    SELECT * from @TableA
    EXCEPT
    SELECT * from @TableB
    UNION ALL
    SELECT * from @TableB
    EXCEPT
    SELECT * from @TableA
    ) t1

    Any help on any of these three things would be greatly appreciated.

    Thanks in advance,

    Mike

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • Thanks for posting your issue and hopefully someone will answer soon.

    This is an automated bump to increase visibility of your question.

  • Mike Scalise wrote:

    I should know the answer to these questions, but I'm having trouble with a couple of my queries.

    First, my goal here is to compare data from a table in a source database with data from a table in another database to see whether there are any differences. Here's a simple example:

    DECLARE @TableA TABLE
    (ID int,
    label VARCHAR(50)
    )

    DECLARE @TableB TABLE
    (ID int,
    label VARCHAR(50)
    )

    INSERT INTO @TableA VALUES (1, NULL)
    INSERT INTO @TableA VALUES (2, 'test')
    INSERT INTO @TableA VALUES (3, 'works')
    INSERT INTO @TableA VALUES (5, 'inonlytablea')


    INSERT INTO @TableB VALUES (1, NULL)
    INSERT INTO @TableB VALUES (2, 'test')
    INSERT INTO @TableB VALUES (3, 'works')
    INSERT INTO @TableB VALUES (4, 'inonlytableb')

    SELECT *
    FROM (
    SELECT * from @TableA
    EXCEPT
    SELECT * from @TableB
    UNION ALL
    SELECT * from @TableB
    EXCEPT
    SELECT * from @TableA
    ) t1

    Obviously there's a difference in each direction. I expect:

    ID, label

    5, 'inonlytablea'

    4, 'inonlytableb'

    However, what I get is:

    ID, label

    4, 'inonlytableb'

    That's problem #1. Again, I'm sure it's something dumb, but I just came across it in trying to come up with an example to demonstrate the second issue I'm having, which is that the source table's columns are all of the Latin1_General_BIN collation whereas the target table's are SQL_Latin1_General_CP1_CI_AS.

    How would you address this? I mean, I know I could do something like:

    DECLARE @TableA TABLE
    (ID int,
    label VARCHAR(50) COLLATE Latin1_General_BIN
    )

    DECLARE @TableB TABLE
    (ID int,
    label VARCHAR(50)
    )

    INSERT INTO @TableA VALUES (1, NULL)
    INSERT INTO @TableA VALUES (2, 'test')
    INSERT INTO @TableA VALUES (3, 'works')
    INSERT INTO @TableA VALUES (5, 'inonlytablea')


    INSERT INTO @TableB VALUES (1, NULL)
    INSERT INTO @TableB VALUES (2, 'test')
    INSERT INTO @TableB VALUES (3, 'works')
    INSERT INTO @TableB VALUES (4, 'inonlytableb')

    SELECT *
    FROM (
    SELECT ID, label from @TableA
    EXCEPT
    SELECT ID, label COLLATE Latin1_General_BIN from @TableB
    UNION ALL
    SELECT ID, label COLLATE Latin1_General_BIN from @TableB
    EXCEPT
    SELECT ID, label from @TableA
    ) t1

    but that doesn't seem to be great for performance. And in reality, we're talking about thousands of columns across 100s of tables.

    And my third problem...let's say there's a good solution to the collation conflict. Some of these source fields are of "text" data type, which cannot be used in an EXCEPT. So what is a good way I can compare these fields and return the delta so I can synchronize the target to the source? Here's an example of the issue--similar to the one above:

    DECLARE @TableA TABLE
    (ID int,
    label text
    )

    DECLARE @TableB TABLE
    (ID int,
    label text
    )

    INSERT INTO @TableA VALUES (1, NULL)
    INSERT INTO @TableA VALUES (2, 'test')
    INSERT INTO @TableA VALUES (3, 'works')
    INSERT INTO @TableA VALUES (5, 'inonlytablea')


    INSERT INTO @TableB VALUES (1, NULL)
    INSERT INTO @TableB VALUES (2, 'test')
    INSERT INTO @TableB VALUES (3, 'works')
    INSERT INTO @TableB VALUES (4, 'inonlytableb')

    SELECT *
    FROM (
    SELECT * from @TableA
    EXCEPT
    SELECT * from @TableB
    UNION ALL
    SELECT * from @TableB
    EXCEPT
    SELECT * from @TableA
    ) t1

    Any help on any of these three things would be greatly appreciated.

    Thanks in advance,

    Mike

    I haven't worked with collations much, so I don't really have an answer for that.  The first problem is easy.

    What I think you want is (A - B) + (B - A).  What you wrote was A - B + B - A which is ((A - B) + B) - A and is equivalent to B - A.  You just need to add the appropriate parens to get it to evaluate properly.

    DECLARE @TableA TABLE
    (ID int,
    label VARCHAR(MAX)
    )

    DECLARE @TableB TABLE
    (ID int,
    label VARCHAR(MAX)
    )

    INSERT INTO @TableA VALUES (1, NULL)
    INSERT INTO @TableA VALUES (2, 'test')
    INSERT INTO @TableA VALUES (3, 'works')
    INSERT INTO @TableA VALUES (5, 'inonlytablea')


    INSERT INTO @TableB VALUES (1, NULL)
    INSERT INTO @TableB VALUES (2, 'test')
    INSERT INTO @TableB VALUES (3, 'works')
    INSERT INTO @TableB VALUES (4, 'inonlytableb')

    SELECT *
    FROM (
    (
    SELECT * from @TableA
    EXCEPT
    SELECT * from @TableB
    )
    UNION ALL
    (
    SELECT * from @TableB
    EXCEPT
    SELECT * from @TableA
    )
    ) t1

    The second issue is also easy.  text and ntext have both been deprecated for ages.  You should have converted these field to VARCHAR(MAX) or NVARCHAR(MAX), both of which can be used with EXCEPT.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • Drew,

    I feel like such an idiot. You're absolutely right about the parentheses. That solved problem #1.

    I knew the text data type was deprecated but never thought to CONVERT or CAST it o a varchar to then use EXCEPT. I can do that to address that issue, so thank you! However, what this problem (#2) and the collation question (#3) have in common is that if I continue to do my compares (using EXCEPT) with COLLATEs for each column and CONVERTs for some of the columns, there's so much additional processing, not to mention indexes wouldn't be able to be leveraged.

    I think this'll get me over the hump but if you or anyone else has any thoughts for how to address these en masse, I'd love to hear them. Again, I know you don't work much with collations, but in general terms, it's a non-SARG-able operation (like the CONVERT in this case) that I would really like to address in some way.

    In any case, I really appreciate the response and help!!

    Mike

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • Mike Scalise wrote:

    Drew,

    I feel like such an idiot. You're absolutely right about the parentheses. That solved problem #1.

    I knew the text data type was deprecated but never thought to CONVERT or CAST it o a varchar to then use EXCEPT. I can do that to address that issue, so thank you! However, what this problem (#2) and the collation question (#3) have in common is that if I continue to do my compares (using EXCEPT) with COLLATEs for each column and CONVERTs for some of the columns, there's so much additional processing, not to mention indexes wouldn't be able to be leveraged.

    I think this'll get me over the hump but if you or anyone else has any thoughts for how to address these en masse, I'd love to hear them. Again, I know you don't work much with collations, but in general terms, it's a non-SARG-able operation (like the CONVERT in this case) that I would really like to address in some way.

    In any case, I really appreciate the response and help!!

    Mike

    As far as I know, the only way you're going to be able to get the performance improvements you are looking for is to standardize collations.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • I may be wrong, but if you're doing SELECT * FROM A EXCEPT SELECT * FROM B, I wouldn't expect indexes to be used as all rows in both tables need to be read in order to check whether there is a match.

    When I need to do similar comparisons, I find it useful to add a hard-coded value to each column list so that I know which table the row came from in the final result.

    SELECT 'Live', tableA.*
    EXCEPT
    SELECT 'Live', tableB.*
    UNION
    SELECT 'Staging', tableB.*
    EXCEPT
    SELECT 'Staging', tableA.*
  • Set operators work best when the tables are identical.  In this case there are different collations between tables.  Since "we're talking about thousands of columns across 100s of tables" it would be a monstrous and tedious job to manually compare all these tables/columns.  Once more than a handful of tables are involved you pretty much have to use db tools like Redgate Compare.  If you have no budget for tools you could try Visual Studio Community Edition, it's free, which has schema and data comparisons.

    Just curious, why switch collations?  If the source db operates satisfactorily why not keep the existing collation.  "bin" (binary) and "ci" (case insensitive) collations are generally similar.

    Imo and fwiw, to get good results from comparison tools each table should have an integer identity primary key.  Maybe others have had good results with alternate keys... but not in my experience.

    Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply