Remove weird character from one of my SQL tables

  • Hello,

    I have an SSIS package that loads data into a table with all VARCHAR(MAX) datatypes. Let's call this table RAW. Once the RAW table is loaded, I then transform/convert the data and push it into a production table. After several grueling days of my package failing I decided to dig through the data in the RAW table and I found this (�) character in the RAW table. Once I removed this row (manually) everything worked perfectly.

    I tried using a REPLACE() function to remove this character, but it is not working. I did some research on Google and this is apparently an escape character used in PDF files.

    Does anyone know how to remove characters from a SQL table using a script based on the ASCII value? My plan is to load the RAW table and then use an EXECUTE SQL Task to remove all instances of this (�). I am not sure how to do this in the SQL script.

    Any help would be greatly appreciated.

    STG

    The are no problems, only solutions. --John Lennon

  • Should be able to REPLACE the NCHAR(65533).

    Something like this:

    DECLARE @UnicodeString NCHAR(100);

    SET @UnicodeString=NCHAR(65533)+N'Hey, what''s that at the beginning?';

    SELECT @UnicodeString;

    SELECT REPLACE(@UnicodeString,NCHAR(65533),'');

    Cheers!

  • Awesome! Thank you very much for making my Friday :-D:-D

    The are no problems, only solutions. --John Lennon

  • Ok, so here is what I am trying to do but it is not working. Any ideas?

    UPDATE TABLE

    SET DS = REPLACE(DS,NCHAR(65533),'');

    The are no problems, only solutions. --John Lennon

  • SQLTougherGuy (9/11/2015)


    Any ideas?

    Perhaps NCHAR(65533) is not actually the offending character?

    At the least I would suggest doing

    UPDATE TABLE

    SET DS = REPLACE(DS,NCHAR(65533),'')

    WHERE DS LIKE '%' + NCHAR(65533) + '%'

    to reduce the number of rows updated / time / log space used, and perhaps add

    PRINT CONVERT(varchar(20), @@ROWCOUNT) + ' NCHAR(65533) rows fixed'

    When we import via staging tables like this we add two columns for HasError (Bit) and ErrorMessage (varchar(MAX))

    We update to "flag" any rows that fail any validation rules we have - to which we add new "rules" over time as new data-validation problems are found.

    UPDATE TABLE

    SET HasError = 1, ErrorMessage = IsNull(ErrorMessage+'; ', '') + 'Invalid character'

    WHERE DS LIKE '%[^' + CHAR(32) + '-' + CHAR(127) + ']%' -- Any character not within Space to Tilde

    PRINT CONVERT(varchar(20), @@ROWCOUNT) + ' rows have invalid characters'

    We would likely have other updates that check that Text Dates are valid, Customer IDs can be found, all those sorts of things.

    Then we either import only the rows that don't have HasError = 1, or we abort the whole import if ANY rows have HasError set.

    We report ErrorMessage (and PKeys) to end user for any rows with HasError = 1

  • Just to confirm, it's running without errors but not actually replacing any characters?

    Might be something like this: https://connect.microsoft.com/SQLServer/feedback/details/385082/replace-function-does-not-work-with-valid-unicode-characters-above-65500

    Try using the explicit binary collation as in that workaround, i.e.:

    UPDATE TABLE

    SET DS = REPLACE(DS COLLATE Latin1_General_BIN,NCHAR(65533),'');

    If that doesn't work, then offhand I'm not too sure about other possibilities. I'll have to see if I can get some time to tinker around and reproduce your situation.

    Cheers!

  • You might have characters that are not NCHAR(65533).

    You need to identify the characters that don't belong to your dataset.

    CREATE TABLE #CharsTest(

    nstring nvarchar(100));

    INSERT INTO #CharsTest

    VALUES

    (N'This has a unicode char ?'),

    (N'This has a unicode char ? as well'),

    (N'This has a ascii only chars')

    SELECT nstring, CAST( nstring AS varchar(100)), UNICODE( SUBSTRING( nstring, 25, 1))

    FROM #CharsTest

    WHERE nstring <> CAST( nstring AS varchar(100))

    GO

    DROP TABLE #CharsTest

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • This can help you find unicode characters that you might not want. It will return the string and the characters, you just need to change the test table to your table and the appropriate columns.

    CREATE TABLE #CharsTest(

    nstring nvarchar(100));

    INSERT INTO #CharsTest

    VALUES

    (N'This has a unicode char ?'),

    (N'This also has a unicode char ? '),

    (N'This has a ascii only chars');

    --Query starts here:

    WITH

    E(n) AS(

    SELECT n FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0))E(n)

    ),

    E2(n) AS(

    SELECT a.n FROM E a, E b

    ),

    E4(n) AS(

    SELECT a.n FROM E2 a, E2 b

    ),

    cteTally(n) AS(

    SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) n

    FROM E4

    )

    SELECT nstring, SUBSTRING( nstring, n, 1), UNICODE( SUBSTRING( nstring, n, 1))

    FROM #CharsTest

    JOIN cteTally ON LEN( nString) >= n

    WHERE SUBSTRING( nstring, n, 1) <> CAST( SUBSTRING( nstring, n, 1) AS varchar(100));

    GO

    DROP TABLE #CharsTest

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Yup!!! Was missing the COLLATE.

    Update TABLE

    set DS = Replace(DS, nchar(65533) COLLATE Latin1_General_BIN2, '!')

    Thank you all very much! I love this site :-):-)

    The are no problems, only solutions. --John Lennon

  • SQLTougherGuy (9/10/2015)


    loads data into a table with all VARCHAR(MAX)

    That may not sit very well with NCHAR(65533) ... I'm not sure, but it might be better to change your staging table to Nvarchar(MAX). Perhaps the Binary Collation is fixing the comparison, but maybe the lack of Nvarchar(max) will be hiding some other problem?

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply