Finding special characters within character strings

Question

Post reply

Finding special characters within character strings

Viewing 8 posts - 16 through 22 (of 22 total)

You must be logged in to reply to this topic. Login to reply

begulley SSC Rookie Points: 41 More actions · Answer 1

Try this for the Tab and pipe

Select * from myTable where MyField Like '%' + char(9) + '%' or MyField Like '%' + char(124) + '%'

The Number is the ASCII char number

Jeff Moden SSC Guru Points: 1004412 More actions · Answer 2

Katharine Mahon (8/29/2014)
How to find an embedded Tab or Pipe in an address field? I got into some problems when exported the text files with those special characters.

As begulley demonstrated above, it's pretty easy to find such rows of data. My question would be, what do you want to do about them? Delete them, replace them with spaces, use them as delimiters, or ???

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

begulley SSC Rookie Points: 41 More actions · Answer 3

We replace the characters with a blank or space depending on what they are. We remove them because they cause issues when exporting to Excel from reporting services.

david woakes SSC Enthusiast Points: 124 More actions · Answer 4

Unicode defines potentially 1,114,112 characters. So two bytes won't be enough for all of them!

What actually goes on is the data is encoded as UTF-16. For code points up to 65535 the data is stored in 2 bytes. Above this you get 4 bytes per character using what are called surrogate pairs (two 2 byte sequences).

Just to make life more entertaining to correctly store characters above the base page (code points 65536 and above) you need to choose a collation which supports supplementary character sets (the collation name contains SC).

This means an nchar(10) column might only store 5 characters (in the event they are all in the supplementary character range)

dbuckley-maidt SSC Rookie Points: 42 More actions · Answer 5

Nice article. One thing to be careful of when using LEN and DATALENGTH is that it matters whether the column was declared as CHAR or VARCHAR. CHAR will return different numbers for LEN and DATALENGTH.

For instance -

Declare @v-2 VARCHAR(10), @C CHAR(10)

SET @v-2 = 'Donna'

SET @C = 'Donna'

SELECT '@v', LEN(@v), DATALENGTH(@v) -- returns 5 and 5

SELECT '@c', LEN(@c), DATALENGTH(@c) -- returns 5 and 10

Solomon Rutzky SSCoach Points: 16430 More actions · Answer 6

david woakes (4/8/2016)
Unicode defines potentially 1,114,112 characters. So two bytes won't be enough for all of them!
What actually goes on is the data is encoded as UTF-16. For code points up to 65535 the data is stored in 2 bytes. Above this you get 4 bytes per character using what are called surrogate pairs (two 2 byte sequences).
Just to make life more entertaining to correctly store characters above the base page (code points 65536 and above) you need to choose a collation which supports supplementary character sets (the collation name contains SC).
This means an nchar(10) column might only store 5 characters (in the event they are all in the supplementary character range)

David, thanks for mentioning Supplementary Characters / Surrogate Pairs. There is only one correction to make related to the statements quoted above: a Supplementary Character-aware Collation is not necessary to store and retrieve Code Points above U+FFFF (i.e. values above 65535). All UTF-16 Surrogate Pairs are made up of completely valid UCS-2 (i.e. Code Points U+0 through U+FFFF == values 0 - 65535 == Base Multilingual Plane a.k.a. BMP) Code Points. Those Code Points just happen to be unmapped individually and were reserved prior to UTF-16 existing in order to serve this purpose. Storing and retrieving Unicode data across the full UTF-16 spectrum (i.e. all 1.1 million Code Points; only 200k or so of which are mapped at this point) is not a matter of Collation, and hence works as far back as at least SQL Server 2005, if not even SQL Server 2000 or prior, though I have no way of testing 2000 / 7.0 / 6.5.

What the Supplementary Character-aware Collations (those ending in "_SC" and were introduced in SQL Server 2012) get you is the ability for SQL Server (in terms of the built-in functions as well as sorting) to properly interpret them as a singular Code Point instead of two double-byte sequences. For example:

/*

Supplementary Character = ??

Code Point = U+1F142

Decimal Value = 127298

Surrogate Pair = D83C + DD42

Little Endian representation = 3CD8 + 42DD

*/

SELECT NCHAR(0x01F142), -- ?? or NULL if DB default Collation does not end in _SC

NCHAR(127298), -- ?? or NULL if DB default Collation does not end in _SC

NCHAR(0xD83C) + NCHAR(0xDD42), -- ??

CONVERT(NVARCHAR(50), 0x3CD842DD) -- ??

If you run the code above in a database that has a default Collation ending in "_SC" (again, these were introduced in SQL Server 2012), then all four fields will properly show the "??" character / glyph. But in databases having a default Collation not ending in "_SC", the first two fields will return NULL because they were only coded to handle the initial 0 - 65535 range, which is the UCS-2 specification. However, you can see that the second two fields will always display the correct character / glyph due to injecting the correct UTF-16 Little Endian sequence.

The following example shows how the LEN() built-in function returns a different value for the same four bytes between the "_SC" and non-"_SC" collated columns (and I included a "_BIN2" column just in case there was a question of whether or not they properly interpret Surrogate Pairs / Supplementary Characters since they are more recent and were designed to properly sort Unicode data whereas the deprecated "_BIN" Collations do not).

CREATE TABLE #SupplementaryCharacterTest

(

NonScCollation NVARCHAR(10) COLLATE Latin1_General_100_CI_AS NOT NULL,

ScCollation NVARCHAR(10) COLLATE Latin1_General_100_CI_AS_SC NOT NULL,

BIN2Collation NVARCHAR(10) COLLATE Latin1_General_100_BIN2 NOT NULL

);

DECLARE @Glyph NVARCHAR(10) = NCHAR(0xD83C) + NCHAR(0xDD42);

INSERT INTO #SupplementaryCharacterTest (NonScCollation, ScCollation, BIN2Collation)

VALUES (@Glyph, @Glyph, @Glyph);

SELECT tmp.NonScCollation,

LEN(tmp.NonScCollation) AS [LenNonSC],

DATALENGTH(tmp.NonScCollation) AS [DataLengthNonSC],

'---' AS [---],

tmp.ScCollation,

LEN(tmp.ScCollation) AS [LenSC],

DATALENGTH(tmp.ScCollation) AS [DataLengthSC],

'---' AS [---],

tmp.BIN2Collation,

LEN(tmp.BIN2Collation) AS [LenBIN2],

DATALENGTH(tmp.BIN2Collation) AS [DataLengthBIN2]

FROM #SupplementaryCharacterTest tmp;

NonScGlyph LenNonSC DataLengthNonSC -- ScGlyph LenSC DataLengthSC -- BIN2Glyph LenBIN2 DataLengthBIN2

?? 2 4 -- ?? 1 4 -- ?? 2 4

Other notes on the article:

VARCHAR data is not always 1 to 1 between LEN and DATALENGTH. VARCHAR is always 8-bit, but there are four Code Pages that are Double-Byte Character Sets (DBCS). I provided details and examples on this topic in the following DBA.StackExchange answer: Storing Japanese characters in a table
The pattern matching capabilities of LIKE and PATINDEX to use [char], [rangeBegin-rangeEnd], [^char], [^rangeBegin-rangeEnd], etc are very much not Regular Expressions, and they should not be referred to as being Regular Expressions as doing so just increases the confusion that is already so prevalent on this topic, especially for those who are just starting to learn this stuff. What is offered by the LIKE operator and PATINDEX function is simple pattern matching that just happens to share a very similar syntax to RegEx.

SQL# — https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
Sql Quantum Lift — https://SqlQuantumLift.com/ ( company )
Sql Quantum Leap — https://SqlQuantumLeap.com/ ( blog )
Info sites — Collations • Module Signing • SQLCLR

akljfhnlaflkj SSC Guru Points: 76202 More actions · Answer 7

We have lots of different input sources and bad data just gets in. We don't have the luxury to reject the data so have to deal with it afterward. Thanks for this.

darkhelmutis SSCommitted Points: 1561 More actions · Answer 8

Great ideas in the post!

What I would really like to see and still don't understand why it is not part of the SQL Server: full regular expression support. Regular expressions seem to be a natural fit for relational data, yet SQL Server still only supports a hobbled implementation of it. Yes it is possible to expose this through .NET and CLR, but sadly where I work CLR is rarely if ever allowed.

I can hope!