pattern matching for unicode characters!

  • Hey friends,

    I need to use pattern matching to suggest similar words when a user does a search on the site. i use an algorithm called Jaro Winkler which works like a dream when using english.

    But while using languages like Chinese and Japanese, i am stuck. the algorithm doesnt work .

    cant use wildcard function LIKE coz i'm looking for a similar words not starting or ending with etc.

    Does anyone know any way to use pattern matching in SQL Server for languages like chinese and japanese?

    any help really appreciated.

    Regards

  • What is the defined collation on the column you're searching?

    Can you please provide the table definition, some sample data and any queries you have tried so far?

    If you have doubts about what I asking for kindly read this article: http://www.sqlservercentral.com/articles/Best+Practices/61537/[/url]

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Hey bud

    It is one column in which i store the values for all languages. need to search through this column to find similar words after the user types a word. of course there is another column which assigns a languageID to each entry. like so :

    languageword

    x AC??

    x ???????

    x ???

    x ???

    x ????

    x ?????

    x ???

    x ????

    x ??

    i realize now its not possible to do a string matching like i do for english using the algorithm, coz the asian languages arent exactly made up of alphabets.

    but i would like to get a suggest list working at least, but when i try using LIKE i dont get the desired result.

    so

    select *

    from mytable

    where word like '??%'

    should return the word '???' but doesnt??

    the coalition of the column is 'SQL_Latin1_General_CP1_CI_AS'

    must be the default one for nvarchar in our install. i store all languages in that column so not sure if changing that is an option.

    thanks for ur help mate,

    Regards

  • When working with Unicode columns in SQL Server you have to prefix your literals with a capital N to let SQL Server know your literal is also Unicode.

    Try it like this:

    select *

    from mytable

    where word like N'??%'

    Here is a complete code sample that worked for me:

    USE tempdb

    GO

    IF EXISTS ( SELECT *

    FROM sys.objects

    WHERE object_id = OBJECT_ID(N'dbo.notes')

    AND type IN (N'U') )

    DROP TABLE dbo.notes ;

    GO

    CREATE TABLE dbo.notes

    (

    id INT NOT NULL

    IDENTITY(1, 1)

    PRIMARY KEY,

    language_id CHAR(1) COLLATE SQL_Latin1_General_CP1_CI_AS

    NOT NULL,

    note NVARCHAR(MAX) COLLATE SQL_Latin1_General_CP1_CI_AS

    NOT NULL

    ) ;

    GO

    INSERT INTO dbo.notes

    (language_id, note)

    VALUES ('x', N'AC??'),

    ('x', N'???????'),

    ('x', N'???'),

    ('x', N'???'),

    ('x', N'????'),

    ('x', N'?????'),

    ('x', N'???'),

    ('x', N'????'),

    ('x', N'??') ;

    SELECT *

    FROM dbo.notes

    WHERE note LIKE N'??%'

    GO

    IF EXISTS ( SELECT *

    FROM sys.objects

    WHERE object_id = OBJECT_ID(N'dbo.notes')

    AND type IN (N'U') )

    DROP TABLE dbo.notes ;

    GO

    By the way:

    > the N stands for National and tells SQL Server to treat the literal as containing "Unicode" characters

    > also, Microsoft's use of "Unicode" to describe characters is ambiguous. The "Unicode" implementation in SQL Server is actually UCS-2, which is closely related to UTF-16LE. Unicode may have gotten more votes than UCS-2 when it came to choosing terminology for their marketing materials 😀

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • ah yes works like that, silly me.

    thanks for that. i guess there is no possibility for pattern matching, things like suggesting miss spelled words etc. but the suggest list will be a start i guess.

    thanks man

  • Jin Kazama (5/18/2011)


    ah yes works like that, silly me.

    thanks for that. i guess there is no possibility for pattern matching, things like suggesting miss spelled words etc. but the suggest list will be a start i guess.

    thanks man

    You're welcome, happy to be of service!

    PS I know very little about character-based languages like Chinese but from some simple internet searching it appears others have had your same wishes. Given that it's not a "spelling language" however it appears it's not easy to do. If and when you settle on a solution I would be interested to hear about it, and maybe other search engines would too, so if you have a moment please post back.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply