Find foreign alphabet characters in a column

  • I need to find all foreign alphabet characters in a column. It might Russian, Chinese, ect.

  • TJT (1/21/2016)


    I need to find all foreign alphabet characters in a column. It might Russian, Chinese, ect.

    This isn't very clear what you want here. Are you wanting to write a query that will return only certain characters in a string or just trying to determine if the string has certain characters? What would be the rule for determining if it is a "foreign" language? The ASCII value perhaps? If so, what is the threshold where it would be considered one of the characters you are looking for?

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • TJT (1/21/2016)


    I need to find all foreign alphabet characters in a column. It might Russian, Chinese, ect.

    What's your question?

    As we do not know where you are from, we do not know what 'foreign' means to you.

    Please provide some sample data along with your desired results.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Do you want each character? Or the rows that contain the foreign characters?

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • I want to find any non latin chartacters: cyriliic, chineese, Japanese, ect

  • TJT (1/21/2016)


    I want to find any non latin chartacters: cyriliic, chineese, Japanese, ect

    Identify them or simply remove them?

    😎

  • Just identify them

  • TJT (1/21/2016)


    Just identify them

    Meaning you want them returned? Do you need any other information? This would be a LOT simpler if you would tell us what you want instead of us trying to extract it from you.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • When you say "foreign", I assume you mean identifying any characters not belonging to the Latin General language. Maybe I'm way off base, but as a first stab, experiment to see if comparing the column against a version of itself explicitly casted to SQL_Latin1_General_CP1_CI_AS will at least result in identifying those values containing non Latin General characters.

    select LastName

    from Employee

    where LastName != (cast (LastName as nvarchar(100))

    collate SQL_Latin1_General_CP1_CI_AS);

    https://msdn.microsoft.com/en-us/library/ms143726(v=sql.110).aspx

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Yes I mean identifying any characters not belonging to the Latin General language.

    Your query picks up some, but still misses others.

  • TJT (1/21/2016)


    Yes I mean identifying any characters not belonging to the Latin General language.

    Your query picks up some, but still misses others.

    This is my last attempt at trying. Can you please provide us some sample data and expected output? Unless you provide the details to us we are all guessing.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • TJT (1/21/2016)


    Yes I mean identifying any characters not belonging to the Latin General language.

    Your query picks up some, but still misses others.

    Try to imagine yourself reading this thread for the first time.

    Would you consider 'still misses others' to be sufficiently detailed for people to know what to look at? Provide examples.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • TJT (1/21/2016)


    Just identify them

    Here is how

    😎

    USE tempdb;

    GO

    SET NOCOUNT ON;

    DECLARE @TEST_STRING NVARCHAR(4000) = N'ABCDEFG?Ô?1234567Ò???HIJKLM]? ?NOP??60987654321';

    ;WITH T(N) AS (SELECT N FROM ( VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS X(N))

    , NUMS(N) AS (SELECT TOP(LEN(@TEST_STRING)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N

    FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8,T T9)

    SELECT

    NM.N AS POS

    ,SUBSTRING(@TEST_STRING,NM.N,1) AS EXTENDED_CHAR

    FROM NUMS NM

    WHERE UNICODE(SUBSTRING(@TEST_STRING,NM.N,1)) > 255;

    Output

    POS EXTENDED_CHAR

    ---- -------------

    8 ?

    10 ?

    19 ?

    21 ?

    29 ?

    31 ?

    36 ?

  • Expanding on what Eirikur put together, here's how you could do it against a table (note my comments).

    -- Sample data

    DECLARE @table TABLE(stringID int identity primary key, stringTxt NVARCHAR(4000));

    INSERT @table(stringTxt) VALUES(N'abc'),(N'123?'),('xyz'),(N'xxx??');

    -- to get the string with the location of the bad character

    WITH T(N) AS (SELECT N FROM ( VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS X(N))

    SELECT StringID, Location = N, BadChar = SUBSTRING(t.stringTxt,CAST(NM.N AS int),1)

    FROM @table t

    CROSS APPLY

    (

    SELECT TOP(CAST(LEN(t.stringTxt) AS bigint)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM T T1,T T2,T T3,T T4

    ) NM(N)

    WHERE UNICODE(SUBSTRING(t.stringTxt,CAST(NM.N AS int),1)) > 255;

    -- to simply get the a list of strings that contain the bad character

    WITH T(N) AS (SELECT N FROM ( VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS X(N))

    SELECT StringID

    FROM @table t

    CROSS APPLY

    (

    SELECT TOP(CAST(LEN(t.stringTxt) AS bigint)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM T T1,T T2,T T3,T T4

    ) NM(N)

    WHERE UNICODE(SUBSTRING(t.stringTxt,CAST(NM.N AS int),1)) > 255

    GROUP BY stringID;

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • I tried

    SELECT DISTINCT PGNAME

    FROM INSTALLS

    WHERE PGNAME != (CAST (PGNAME as NVARCHAR(100)) collate SQL_Latin1_General_CP1_CI_AS)

    This returned rows like:

    ???????? ????? ??? ????? ?????????? ?????? ???????

    Thermo Fisher Scientific Inc.

    But it failed to find rows like:

    ???????

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply