String value "looking" the same, but there is a difference (cannot use distinct)

Question

String value "looking" the same, but there is a difference (cannot use distinct)

donstarter

Old Hand

Points: 378
More actions
February 24, 2016 at 5:49 am

#308246

Hi all,
I am having an issue with using distinct on one single column in a table and getting multiple result rows for a string "looking" the same. When I hashbyte the values, I can see that there must be a difference.But I have no clue how to get rid of this behaviour... please help:
MS SQL Server 2014 (Server, DB and Field Collation are all the same - Latin1_General_CI_AS)
City column is defined as nvarchar(50)
Table with column "City", doing a "select distinct city from table1" I get below city multiple times:
Column City
Zweibrücken
Zweibrücken
Zweibrücken
When doing a Hashbyte, I can see that there seems to be a difference of these values:
Column CityColumn City_Hashbyte
Zweibrücken0x31244719AA18450C404BD27E1F0B07A36E7840A2
Zweibrücken0xBD0FF24BCB34A0A2C83AE9069E59F33A588110E9
Zweibrücken0xD7BEF38644E48185EBC60B207EFF33F7DE50D8BF
All data is imported from one single XML file onto one single table. So I really don't get where this difference is coming from and how this can be overcome.
Thanks for any input / suggestion.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply

Phil Parkin SSC Guru Points: 246963 More actions · Answer 1

donstarter (2/24/2016)
Hi all,
I am having an issue with using distinct on one single column in a table and getting multiple result rows for a string "looking" the same. When I hashbyte the values, I can see that there must be a difference.But I have no clue how to get rid of this behaviour... please help:
MS SQL Server 2014 (Server, DB and Field Collation are all the same - Latin1_General_CI_AS)
City column is defined as nvarchar(50)
Table with column "City", doing a "select distinct city from table1" I get below city multiple times:
Column City
Zweibrücken
Zweibrücken
Zweibrücken
When doing a Hashbyte, I can see that there seems to be a difference of these values:
Column CityColumn City_Hashbyte
Zweibrücken0x31244719AA18450C404BD27E1F0B07A36E7840A2
Zweibrücken0xBD0FF24BCB34A0A2C83AE9069E59F33A588110E9
Zweibrücken0xD7BEF38644E48185EBC60B207EFF33F7DE50D8BF
All data is imported from one single XML file onto one single table. So I really don't get where this difference is coming from and how this can be overcome.
Thanks for any input / suggestion.

You could try pasting the data into Notepad++ and then selecting the View/Show Symbol/Show All Characters option. Maybe that will show something.

The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin

donstarter Old Hand Points: 378 More actions · Answer 2

Thanks for your input Phil,

Yes, I can see the difference of the strings when pasting into Notepad++ kind of same as I see the difference when doing a hashbyte. In two of the values are some is a "HOP" in front or after the "ü".

Any idea how I can overcome this?

Phil Parkin SSC Guru Points: 246963 More actions · Answer 3

donstarter (2/24/2016)
Thanks for your input Phil,
Yes, I can see the difference of the strings when pasting into Notepad++ kind of same as I see the difference when doing a hashbyte. In two of the values are some is a "HOP" in front or after the "ü".
Any idea how I can overcome this?

Use some sort of clean string function to remove the rubbish characters. If you can't find anything suitable (no cursors or while loops please :-)), post back and I'll see whether I can.

Or, if that is the only character causing you issues, find out its ASCII value and do a REPLACE() on the city name to remove it.

The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin

Ed Wagner SSC Guru Points: 287026 More actions · Answer 4

The way I see it, there are two directions you can go here.

1. If, like Phil said, you only have 1 or 2 characters that are causing you problems, you can REPLACE them.

2. If you have a bunch of problem characters AND you can define a range of ASCII values you want to keep and eliminate everything else, let us know. This is probably best solved with an ITVF to return only the "good" characters.

Whatever direction you go, please heed Phil's advice - no loops.

Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions

donstarter Old Hand Points: 378 More actions · Answer 5

Hi all,

Thanks a lot for your input, that helped a lot! I will go for the REPLACE version for now, this should do the trick for the specific issue now. Thanks

Luis Cazares SSC Guru Points: 183706 More actions · Answer 6

donstarter (2/24/2016)
Hi all,
I am having an issue with using distinct on one single column in a table and getting multiple result rows for a string "looking" the same. When I hashbyte the values, I can see that there must be a difference.But I have no clue how to get rid of this behaviour... please help:
MS SQL Server 2014 (Server, DB and Field Collation are all the same - Latin1_General_CI_AS)
City column is defined as nvarchar(50)
Table with column "City", doing a "select distinct city from table1" I get below city multiple times:
Column City
Zweibrücken
Zweibrücken
Zweibrücken
When doing a Hashbyte, I can see that there seems to be a difference of these values:
Column CityColumn City_Hashbyte
Zweibrücken0x31244719AA18450C404BD27E1F0B07A36E7840A2
Zweibrücken0xBD0FF24BCB34A0A2C83AE9069E59F33A588110E9
Zweibrücken0xD7BEF38644E48185EBC60B207EFF33F7DE50D8BF
All data is imported from one single XML file onto one single table. So I really don't get where this difference is coming from and how this can be overcome.
Thanks for any input / suggestion.

Using the values you posted, it seems that there's an invisible value before or after the ü.

Check the following code:

WITH

Cities AS(

SELECT CityId, City

FROM (VALUES(1,N'Zweibrücken'),

(2,N'Zweibrücken'),

(3,N'Zweibrücken'))x(CityId, City)

),

E(n) AS(

SELECT n FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0))E(n)

),

E2(n) AS(

SELECT a.n FROM E a, E b

),

E4(n) AS(

SELECT a.n FROM E2 a, E2 b

),

cteTally(n) AS(

SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) n

FROM E4

)

SELECT c.*, SUBSTRING(City, n, 1), UNICODE(SUBSTRING(City, n, 1))

FROM cteTally t

JOIN Cities c ON t.n <= LEN(c.City)

ORDER BY CityId, n;

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2