August 18, 2014 at 11:23 pm
Comments posted to this topic are about the item Finding special characters within character strings
August 19, 2014 at 4:48 am
You say you run this every morning? Why not just set up a constraint on the column to prevent the problem characters from getting in, in the first place?
August 19, 2014 at 6:54 am
interesting summary article, worth to save for later use
August 19, 2014 at 8:24 am
wes_springob (8/19/2014)
You say you run this every morning? Why not just set up a constraint on the column to prevent the problem characters from getting in, in the first place?
Heh... because if someone fails to mail me a check because of a typo in a bulk load, there will be hell to pay! 😀 For example, if I live at 5678 Main Street and someone accidently had the SHIFT key depressed when they typed the "5" so that the address came out as %678 Main Street (just as an example), I'd want someone working on the correction rather than just rejecting the row of bulk data. Why not do those types of checks in the front end before it makes it to a bulk file? Heh... yeah... people always do those kinds of checks in the front end, don't they? 😛
It just doesn't hurt to do these types of checks in the database, especially where any form of keyboard input was ever involved. It would be nice if they did it in the front end, but there's no guarantees there. These types of checks might also be good for capturing examples of what people have been typing to design new "validation" code for the front end or whatever source the data came from.
--Jeff Moden
Change is inevitable... Change for the better is not.
August 19, 2014 at 8:53 am
What gets me most of the time are Carriage Returns, Line Feeds and tabs.
We receive many different file formats from different clients, while we try to standardize feeds as much as possible, these sometimes end up in the data I have to process.
Usually Excel is involved in some way...Analysts love excel but for ETL it makes me cringe.
August 19, 2014 at 9:06 am
Beware the Unicode character FULLWIDTH APOSTROPHE ' (U+FF07), because we had a problem with it once since it looks like SQL Server will treat it like any other apostrophe, i.e. as a string constant delimiter or apostrophe escape. So if you're building dynamic SQL (which of course you aren't, and of course neither were we) and you're manually looking for characters to escape or otherwise handle before executing your SQL string, you may need to handle it too.
August 19, 2014 at 9:06 am
If you own, or are a member of, the project that developed the system that would be great. Sadly, many times you are administering a system that you cannot change, at least in the short term, or have very little control over at that level.
August 19, 2014 at 11:17 am
Thanks for the article! One clarification:
Also notice that “Ça va” is missing. If the column is case sensitive, should that not have been captured? Again this is not a course on regular expressions but note: the regular expression does not see the accented characters as special. Meaning (é and e) or (Ä and A) are evaluated as the same, as far as the expression is concerned.
This is incorrect, they are not evaluated the same. When checking ranges in regular expressions, the collation setting is used. When you specify [A-Z], this includes every character alphabetically sorted between 'A' and 'Z' according to the collation. 'Ç' is sorted between 'C' and 'D' alphabetically in this collation, and so is included in [A-Z].
SELECT 'IT WORKS'
WHERE 'C' COLLATE SQL_Latin1_General_CP1_CI_AS LIKE ('%[C-C]%');
SELECT 'IT WORKS'
WHERE 'C' COLLATE SQL_Latin1_General_CP1_CI_AS LIKE ('%[B-C]%');
SELECT 'IT WORKS'
WHERE 'C' COLLATE SQL_Latin1_General_CP1_CI_AS LIKE ('%[C-D]%');
SELECT 'IT WORKS'
WHERE 'Ç' COLLATE SQL_Latin1_General_CP1_CI_AS LIKE ('%[C-C]%');
SELECT 'IT WORKS'
WHERE 'Ç' COLLATE SQL_Latin1_General_CP1_CI_AS LIKE ('%[B-C]%');
SELECT 'IT WORKS'
WHERE 'Ç' COLLATE SQL_Latin1_General_CP1_CI_AS LIKE ('%[C-D]%');
In 1-3, you can see that a plain 'C' matches all 3 ranges. However, in 4-6, the accented 'Ç' character only matches the range [C-D].
August 19, 2014 at 12:20 pm
Thanks for the post! I set up a search on all the data in my modest sized database and found that LEN also ignores CRLF characters. I found I had a field that had been loaded with trailing blanks up to the last two characters 0D 0A. The value from LEN varied but DATALEN was always 65. Fixed with LEFT on the LEN.
August 19, 2014 at 3:38 pm
I'm not sure why we don't check our data as many ways as possible. I have an environment that assumes everything is installed as expected, and a help desk busy with inquiries for when it isn't.
The key is to develop testing that is efficient to run.
August 20, 2014 at 3:45 am
NBSteve (8/19/2014)
This is incorrect, they are not evaluated the same. When checking ranges in regular expressions, the collation setting is used. When you specify [A-Z], this includes every character alphabetically sorted between 'A' and 'Z' according to the collation. 'Ç' is sorted between 'C' and 'D' alphabetically in this collation, and so is included in [A-Z].
This is also the reason why [0-9] is not a good choice for finding "numbers". Using a UK installation of SQL Server and the same collation (Latin1_General_CS_AS) as the article, I get 16 matches in the first 255 characters - it includes characters for one-quarter, one-half, three-quarters, superscript 1, superscript 2 and superscript 3! There are 249 matches in 16-bit character space.
I think using [0123456789] is safe, but do please tell me if I'm wrong.
August 25, 2014 at 1:27 pm
When looking for a range, is this according to the ascii table? For example:
where FixSubject like '%[' + Char(1) + '-' + Char(11) + ']%'
Does this find everything on the Ascii table between (start of heading) to (Vertical Tab)? or is it indeterminate what it will return?
Also, would the performance be better using individual like's or a range?
August 25, 2014 at 1:56 pm
begulley (8/25/2014)
When looking for a range, is this according to the ascii table? For example:where FixSubject like '%[' + Char(1) + '-' + Char(11) + ']%'
Does this find everything on the Ascii table between (start of heading) to (Vertical Tab)? or is it indeterminate what it will return?
Also, would the performance be better using individual like's or a range?
It's based on collation sorting, which is likely to be related to the ascii table, but NOT the same. Best bet is to do a simple test:
DECLARE @startchar int = 1, @endchar int = 11;
WITH
TALLY10(N) AS (
SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL
SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL
SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9 UNION ALL SELECT 0
),
TALLY100(N) AS (SELECT 1 FROM TALLY10 a, TALLY10 b),
TALLY10K(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM TALLY100 a, TALLY100 b)
SELECT N ASCIICODE, CHAR(N) CHARVALUE
FROM TALLY10K
WHERE CHAR(N) LIKE '[' + CHAR(@startchar) + '-' + CHAR(@endchar) + ']'
ORDER BY N
This just uses an inline tally table to return a list of all ascii code's and char values that match the range between @startchar and @endchar. On my system, it does appear to work as you expect for char(1)-char(11), but I don't know if this is true for all collations.
Note that 1-47 seems like it might be a 1-1 match, but 1-48 will suddenly return 132 "matching" rows as many punctuation marks with higher ascii values are sorted prior to zero (char(48)) in my collation.
August 29, 2014 at 6:35 am
How to find an embedded Tab or Pipe in an address field? I got into some problems when exported the text files with those special characters.
Viewing 15 posts - 1 through 15 (of 22 total)
You must be logged in to reply to this topic. Login to reply