August 29, 2014 at 7:32 am
Try this for the Tab and pipe
Select * from myTable where MyField Like '%' + char(9) + '%' or MyField Like '%' + char(124) + '%'
The Number is the ASCII char number
September 3, 2014 at 9:45 pm
Katharine Mahon (8/29/2014)
How to find an embedded Tab or Pipe in an address field? I got into some problems when exported the text files with those special characters.
As begulley demonstrated above, it's pretty easy to find such rows of data. My question would be, what do you want to do about them? Delete them, replace them with spaces, use them as delimiters, or ???
--Jeff Moden
Change is inevitable... Change for the better is not.
September 4, 2014 at 6:18 am
We replace the characters with a blank or space depending on what they are. We remove them because they cause issues when exporting to Excel from reporting services.
April 8, 2016 at 2:30 am
Unicode defines potentially 1,114,112 characters. So two bytes won't be enough for all of them!
What actually goes on is the data is encoded as UTF-16. For code points up to 65535 the data is stored in 2 bytes. Above this you get 4 bytes per character using what are called surrogate pairs (two 2 byte sequences).
Just to make life more entertaining to correctly store characters above the base page (code points 65536 and above) you need to choose a collation which supports supplementary character sets (the collation name contains SC).
This means an nchar(10) column might only store 5 characters (in the event they are all in the supplementary character range)
April 8, 2016 at 7:14 am
Nice article. One thing to be careful of when using LEN and DATALENGTH is that it matters whether the column was declared as CHAR or VARCHAR. CHAR will return different numbers for LEN and DATALENGTH.
For instance -
Declare @v-2 VARCHAR(10), @C CHAR(10)
SET @v-2 = 'Donna'
SET @C = 'Donna'
SELECT '@v', LEN(@v), DATALENGTH(@v) -- returns 5 and 5
SELECT '@c', LEN(@c), DATALENGTH(@c) -- returns 5 and 10
April 9, 2016 at 4:28 pm
david woakes (4/8/2016)
Unicode defines potentially 1,114,112 characters. So two bytes won't be enough for all of them!What actually goes on is the data is encoded as UTF-16. For code points up to 65535 the data is stored in 2 bytes. Above this you get 4 bytes per character using what are called surrogate pairs (two 2 byte sequences).
Just to make life more entertaining to correctly store characters above the base page (code points 65536 and above) you need to choose a collation which supports supplementary character sets (the collation name contains SC).
This means an nchar(10) column might only store 5 characters (in the event they are all in the supplementary character range)
David, thanks for mentioning Supplementary Characters / Surrogate Pairs. There is only one correction to make related to the statements quoted above: a Supplementary Character-aware Collation is not necessary to store and retrieve Code Points above U+FFFF (i.e. values above 65535). All UTF-16 Surrogate Pairs are made up of completely valid UCS-2 (i.e. Code Points U+0 through U+FFFF == values 0 - 65535 == Base Multilingual Plane a.k.a. BMP) Code Points. Those Code Points just happen to be unmapped individually and were reserved prior to UTF-16 existing in order to serve this purpose. Storing and retrieving Unicode data across the full UTF-16 spectrum (i.e. all 1.1 million Code Points; only 200k or so of which are mapped at this point) is not a matter of Collation, and hence works as far back as at least SQL Server 2005, if not even SQL Server 2000 or prior, though I have no way of testing 2000 / 7.0 / 6.5.
What the Supplementary Character-aware Collations (those ending in "_SC" and were introduced in SQL Server 2012) get you is the ability for SQL Server (in terms of the built-in functions as well as sorting) to properly interpret them as a singular Code Point instead of two double-byte sequences. For example:
/*
Supplementary Character = ??
Code Point = U+1F142
Decimal Value = 127298
Surrogate Pair = D83C + DD42
Little Endian representation = 3CD8 + 42DD
*/
SELECT NCHAR(0x01F142), -- ?? or NULL if DB default Collation does not end in _SC
NCHAR(127298), -- ?? or NULL if DB default Collation does not end in _SC
NCHAR(0xD83C) + NCHAR(0xDD42), -- ??
CONVERT(NVARCHAR(50), 0x3CD842DD) -- ??
If you run the code above in a database that has a default Collation ending in "_SC" (again, these were introduced in SQL Server 2012), then all four fields will properly show the "??" character / glyph. But in databases having a default Collation not ending in "_SC", the first two fields will return NULL because they were only coded to handle the initial 0 - 65535 range, which is the UCS-2 specification. However, you can see that the second two fields will always display the correct character / glyph due to injecting the correct UTF-16 Little Endian sequence.
The following example shows how the LEN() built-in function returns a different value for the same four bytes between the "_SC" and non-"_SC" collated columns (and I included a "_BIN2" column just in case there was a question of whether or not they properly interpret Surrogate Pairs / Supplementary Characters since they are more recent and were designed to properly sort Unicode data whereas the deprecated "_BIN" Collations do not).
CREATE TABLE #SupplementaryCharacterTest
(
NonScCollation NVARCHAR(10) COLLATE Latin1_General_100_CI_AS NOT NULL,
ScCollation NVARCHAR(10) COLLATE Latin1_General_100_CI_AS_SC NOT NULL,
BIN2Collation NVARCHAR(10) COLLATE Latin1_General_100_BIN2 NOT NULL
);
DECLARE @Glyph NVARCHAR(10) = NCHAR(0xD83C) + NCHAR(0xDD42);
INSERT INTO #SupplementaryCharacterTest (NonScCollation, ScCollation, BIN2Collation)
VALUES (@Glyph, @Glyph, @Glyph);
SELECT tmp.NonScCollation,
LEN(tmp.NonScCollation) AS [LenNonSC],
DATALENGTH(tmp.NonScCollation) AS [DataLengthNonSC],
'---' AS [---],
tmp.ScCollation,
LEN(tmp.ScCollation) AS [LenSC],
DATALENGTH(tmp.ScCollation) AS [DataLengthSC],
'---' AS [---],
tmp.BIN2Collation,
LEN(tmp.BIN2Collation) AS [LenBIN2],
DATALENGTH(tmp.BIN2Collation) AS [DataLengthBIN2]
FROM #SupplementaryCharacterTest tmp;
NonScGlyph LenNonSC DataLengthNonSC -- ScGlyph LenSC DataLengthSC -- BIN2Glyph LenBIN2 DataLengthBIN2
?? 2 4 -- ?? 1 4 -- ?? 2 4
Other notes on the article:
SQL# — https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
Sql Quantum Lift — https://SqlQuantumLift.com/ ( company )
Sql Quantum Leap — https://SqlQuantumLeap.com/ ( blog )
Info sites — Collations • Module Signing • SQLCLR
April 11, 2016 at 6:54 am
We have lots of different input sources and bad data just gets in. We don't have the luxury to reject the data so have to deal with it afterward. Thanks for this.
April 12, 2016 at 11:13 am
Great ideas in the post!
What I would really like to see and still don't understand why it is not part of the SQL Server: full regular expression support. Regular expressions seem to be a natural fit for relational data, yet SQL Server still only supports a hobbled implementation of it. Yes it is possible to expose this through .NET and CLR, but sadly where I work CLR is rarely if ever allowed.
I can hope!
Viewing 8 posts - 16 through 22 (of 22 total)
You must be logged in to reply to this topic. Login to reply