Help with Query Results and Collation

Question

Post reply

Help with Query Results and Collation

Mike Scalise

SSCrazy Eights

Points: 8145

August 23, 2019 at 8:37 pm

#3674093

I should know the answer to these questions, but I'm having trouble with a couple of my queries.

First, my goal here is to compare data from a table in a source database with data from a table in another database to see whether there are any differences. Here's a simple example:

DECLARE @TableA TABLE
(ID int,
label VARCHAR(50)
)

DECLARE @TableB TABLE
(ID int,
label VARCHAR(50)
)

INSERT INTO @TableA VALUES (1, NULL)
INSERT INTO @TableA VALUES (2, 'test')
INSERT INTO @TableA VALUES (3, 'works')
INSERT INTO @TableA VALUES (5, 'inonlytablea')


INSERT INTO @TableB VALUES (1, NULL)
INSERT INTO @TableB VALUES (2, 'test')
INSERT INTO @TableB VALUES (3, 'works')
INSERT INTO @TableB VALUES (4, 'inonlytableb')

SELECT *
FROM (
SELECT * from @TableA
EXCEPT
SELECT * from @TableB
UNION ALL
SELECT * from @TableB
EXCEPT
SELECT * from @TableA
) t1

Obviously there's a difference in each direction. I expect:

ID, label

5, 'inonlytablea'

4, 'inonlytableb'

However, what I get is:

ID, label

4, 'inonlytableb'

That's problem #1. Again, I'm sure it's something dumb, but I just came across it in trying to come up with an example to demonstrate the second issue I'm having, which is that the source table's columns are all of the Latin1_General_BIN collation whereas the target table's are SQL_Latin1_General_CP1_CI_AS.

How would you address this? I mean, I know I could do something like:

DECLARE @TableA TABLE
(ID int,
label VARCHAR(50) COLLATE Latin1_General_BIN
)

DECLARE @TableB TABLE
(ID int,
label VARCHAR(50)
)

INSERT INTO @TableA VALUES (1, NULL)
INSERT INTO @TableA VALUES (2, 'test')
INSERT INTO @TableA VALUES (3, 'works')
INSERT INTO @TableA VALUES (5, 'inonlytablea')


INSERT INTO @TableB VALUES (1, NULL)
INSERT INTO @TableB VALUES (2, 'test')
INSERT INTO @TableB VALUES (3, 'works')
INSERT INTO @TableB VALUES (4, 'inonlytableb')

SELECT *
FROM (
SELECT ID, label from @TableA
EXCEPT
SELECT ID, label COLLATE Latin1_General_BIN from @TableB
UNION ALL
SELECT ID, label COLLATE Latin1_General_BIN from @TableB
EXCEPT
SELECT ID, label from @TableA
) t1

but that doesn't seem to be great for performance. And in reality, we're talking about thousands of columns across 100s of tables.

And my third problem...let's say there's a good solution to the collation conflict. Some of these source fields are of "text" data type, which cannot be used in an EXCEPT. So what is a good way I can compare these fields and return the delta so I can synchronize the target to the source? Here's an example of the issue--similar to the one above:

DECLARE @TableA TABLE
(ID int,
label text
)

DECLARE @TableB TABLE
(ID int,
label text
)

INSERT INTO @TableA VALUES (1, NULL)
INSERT INTO @TableA VALUES (2, 'test')
INSERT INTO @TableA VALUES (3, 'works')
INSERT INTO @TableA VALUES (5, 'inonlytablea')


INSERT INTO @TableB VALUES (1, NULL)
INSERT INTO @TableB VALUES (2, 'test')
INSERT INTO @TableB VALUES (3, 'works')
INSERT INTO @TableB VALUES (4, 'inonlytableb')

SELECT *
FROM (
SELECT * from @TableA
EXCEPT
SELECT * from @TableB
UNION ALL
SELECT * from @TableB
EXCEPT
SELECT * from @TableA
) t1

Any help on any of these three things would be greatly appreciated.

Thanks in advance,

Mike

Mike Scalise, PMP
https://www.michaelscalise.com

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply

Site Owners SSC Guru Points: 80330 More actions · Answer 1

Thanks for posting your issue and hopefully someone will answer soon.

This is an automated bump to increase visibility of your question.

drew.allen SSC Guru Points: 76985 More actions · Answer 2

Mike Scalise wrote:

I should know the answer to these questions, but I'm having trouble with a couple of my queries.
First, my goal here is to compare data from a table in a source database with data from a table in another database to see whether there are any differences. Here's a simple example:
DECLARE @TableA TABLE
(ID int,
label VARCHAR(50)
)

DECLARE @TableB TABLE
(ID int,
label VARCHAR(50)
)

INSERT INTO @TableA VALUES (1, NULL)
INSERT INTO @TableA VALUES (2, 'test')
INSERT INTO @TableA VALUES (3, 'works')
INSERT INTO @TableA VALUES (5, 'inonlytablea')


INSERT INTO @TableB VALUES (1, NULL)
INSERT INTO @TableB VALUES (2, 'test')
INSERT INTO @TableB VALUES (3, 'works')
INSERT INTO @TableB VALUES (4, 'inonlytableb')

SELECT *
FROM (
SELECT * from @TableA
EXCEPT
SELECT * from @TableB
UNION ALL
SELECT * from @TableB
EXCEPT
SELECT * from @TableA
) t1
Obviously there's a difference in each direction. I expect:
ID, label
5, 'inonlytablea'
4, 'inonlytableb'
However, what I get is:
ID, label
4, 'inonlytableb'
That's problem #1. Again, I'm sure it's something dumb, but I just came across it in trying to come up with an example to demonstrate the second issue I'm having, which is that the source table's columns are all of the Latin1_General_BIN collation whereas the target table's are SQL_Latin1_General_CP1_CI_AS.
How would you address this? I mean, I know I could do something like:
DECLARE @TableA TABLE
(ID int,
label VARCHAR(50) COLLATE Latin1_General_BIN
)

DECLARE @TableB TABLE
(ID int,
label VARCHAR(50)
)

INSERT INTO @TableA VALUES (1, NULL)
INSERT INTO @TableA VALUES (2, 'test')
INSERT INTO @TableA VALUES (3, 'works')
INSERT INTO @TableA VALUES (5, 'inonlytablea')


INSERT INTO @TableB VALUES (1, NULL)
INSERT INTO @TableB VALUES (2, 'test')
INSERT INTO @TableB VALUES (3, 'works')
INSERT INTO @TableB VALUES (4, 'inonlytableb')

SELECT *
FROM (
SELECT ID, label from @TableA
EXCEPT
SELECT ID, label COLLATE Latin1_General_BIN from @TableB
UNION ALL
SELECT ID, label COLLATE Latin1_General_BIN from @TableB
EXCEPT
SELECT ID, label from @TableA
) t1
but that doesn't seem to be great for performance. And in reality, we're talking about thousands of columns across 100s of tables.
And my third problem...let's say there's a good solution to the collation conflict. Some of these source fields are of "text" data type, which cannot be used in an EXCEPT. So what is a good way I can compare these fields and return the delta so I can synchronize the target to the source? Here's an example of the issue--similar to the one above:
DECLARE @TableA TABLE
(ID int,
label text
)

DECLARE @TableB TABLE
(ID int,
label text
)

INSERT INTO @TableA VALUES (1, NULL)
INSERT INTO @TableA VALUES (2, 'test')
INSERT INTO @TableA VALUES (3, 'works')
INSERT INTO @TableA VALUES (5, 'inonlytablea')


INSERT INTO @TableB VALUES (1, NULL)
INSERT INTO @TableB VALUES (2, 'test')
INSERT INTO @TableB VALUES (3, 'works')
INSERT INTO @TableB VALUES (4, 'inonlytableb')

SELECT *
FROM (
SELECT * from @TableA
EXCEPT
SELECT * from @TableB
UNION ALL
SELECT * from @TableB
EXCEPT
SELECT * from @TableA
) t1
Any help on any of these three things would be greatly appreciated.
Thanks in advance,
Mike

I haven't worked with collations much, so I don't really have an answer for that. The first problem is easy.

What I think you want is (A - B) + (B - A). What you wrote was A - B + B - A which is ((A - B) + B) - A and is equivalent to B - A. You just need to add the appropriate parens to get it to evaluate properly.

DECLARE @TableA TABLE
(ID int,
label VARCHAR(MAX)
)

DECLARE @TableB TABLE
(ID int,
label VARCHAR(MAX)
)

INSERT INTO @TableA VALUES (1, NULL)
INSERT INTO @TableA VALUES (2, 'test')
INSERT INTO @TableA VALUES (3, 'works')
INSERT INTO @TableA VALUES (5, 'inonlytablea')


INSERT INTO @TableB VALUES (1, NULL)
INSERT INTO @TableB VALUES (2, 'test')
INSERT INTO @TableB VALUES (3, 'works')
INSERT INTO @TableB VALUES (4, 'inonlytableb')

SELECT *
FROM (
(
SELECT * from @TableA
EXCEPT
SELECT * from @TableB
)
UNION ALL
(
SELECT * from @TableB
EXCEPT
SELECT * from @TableA
)
) t1

The second issue is also easy. text and ntext have both been deprecated for ages. You should have converted these field to VARCHAR(MAX) or NVARCHAR(MAX), both of which can be used with EXCEPT.

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

Mike Scalise SSCrazy Eights Points: 8145 More actions · Answer 3

Drew,

I feel like such an idiot. You're absolutely right about the parentheses. That solved problem #1.

I knew the text data type was deprecated but never thought to CONVERT or CAST it o a varchar to then use EXCEPT. I can do that to address that issue, so thank you! However, what this problem (#2) and the collation question (#3) have in common is that if I continue to do my compares (using EXCEPT) with COLLATEs for each column and CONVERTs for some of the columns, there's so much additional processing, not to mention indexes wouldn't be able to be leveraged.

I think this'll get me over the hump but if you or anyone else has any thoughts for how to address these en masse, I'd love to hear them. Again, I know you don't work much with collations, but in general terms, it's a non-SARG-able operation (like the CONVERT in this case) that I would really like to address in some way.

In any case, I really appreciate the response and help!!

Mike

Mike Scalise, PMP
https://www.michaelscalise.com

Phil Parkin SSC Guru Points: 246960 More actions · Answer 4

Mike Scalise wrote:

Drew,
I feel like such an idiot. You're absolutely right about the parentheses. That solved problem #1.
I knew the text data type was deprecated but never thought to CONVERT or CAST it o a varchar to then use EXCEPT. I can do that to address that issue, so thank you! However, what this problem (#2) and the collation question (#3) have in common is that if I continue to do my compares (using EXCEPT) with COLLATEs for each column and CONVERTs for some of the columns, there's so much additional processing, not to mention indexes wouldn't be able to be leveraged.
I think this'll get me over the hump but if you or anyone else has any thoughts for how to address these en masse, I'd love to hear them. Again, I know you don't work much with collations, but in general terms, it's a non-SARG-able operation (like the CONVERT in this case) that I would really like to address in some way.
In any case, I really appreciate the response and help!!
Mike

As far as I know, the only way you're going to be able to get the performance improvements you are looking for is to standardize collations.

The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin

Chris Wooding SSCarpal Tunnel Points: 4556 More actions · Answer 5

I may be wrong, but if you're doing SELECT * FROM A EXCEPT SELECT * FROM B, I wouldn't expect indexes to be used as all rows in both tables need to be read in order to check whether there is a match.

When I need to do similar comparisons, I find it useful to add a hard-coded value to each column list so that I know which table the row came from in the final result.

SELECT 'Live', tableA.*
EXCEPT
SELECT 'Live', tableB.*
UNION
SELECT 'Staging', tableB.*
EXCEPT
SELECT 'Staging', tableA.*

Steve Collins SSCrazy Points: 2157 More actions · Answer 6

Set operators work best when the tables are identical. In this case there are different collations between tables. Since "we're talking about thousands of columns across 100s of tables" it would be a monstrous and tedious job to manually compare all these tables/columns. Once more than a handful of tables are involved you pretty much have to use db tools like Redgate Compare. If you have no budget for tools you could try Visual Studio Community Edition, it's free, which has schema and data comparisons.

Just curious, why switch collations? If the source db operates satisfactorily why not keep the existing collation. "bin" (binary) and "ci" (case insensitive) collations are generally similar.

Imo and fwiw, to get good results from comparison tools each table should have an integer identity primary key. Maybe others have had good results with alternate keys... but not in my experience.

Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können