Script identification

  • I have a need to identify the script than an nvarchar field is in. Not the language!

    Does anyone have any suggestions/ideas/samples/links they can recommend.

    TIA,

    Darrell

  • You mean something like this?

    Select object_name(id) from dbo.SysComments where Text like '%ColumnName%'

  • Unfortunately no,

    I mean something like this:

    Say I have a field called Greeting and it's value is Hello.

    I need to be able to pass the text to some kind of function or routine that would identify the script that the word is in like below.

    Hello = Latin (meaning that the whole word is in Latin script)

    I am NOT looking forward to this one!!! 🙂

  • I'm sorry but I still don't get it... can you show some sample data/results?

  • Say I have a field called Greeting and it's value is Hello.

    I need to be able to pass the text to some kind of function or routine that would identify the script that the word is in like below.

    Hello = Latin (meaning that the whole word is in Latin script)


    DSP,

    I don't understand the question either.

    By "script" do you mean the font or some similar attribute of the text itself?

    Would the collation provide you what you need?

     

  • Yes I do mean more along the lines of the attributes of the text itself.

    For example this is the name of a town in Cyprus "???a ?a???a ?e???ed????" (unfortunatly the copy paste lost the translation).

    I happen to know that this is in the Greek language using the Greek script but in most cases I don't have anything indicating the script. We are using a BIN2 collation (don't ask :crazy so I can't depend on that. What I need to accomplish is passing in the town name and get a result telling me the script.

    Hope this clears up the confusion a little

    Thanks again,

    Darrell

  • My first (hopefully not best) thought would be to use regular expressions, PATINDEX, and/or CHARINDEX to search the string for characters or combinations of characters unique to the language.  For example, only Spanish uses the ll or the n characters (oops, the translation was lost here too).

    You could hardcode the values or create a table that has the character (or pattern) in one field and the script/language in a second field.  If the text contains the ll, a lookup returns 'Spanish' if an o with an umlat appears, an entry in the table links the character to German (maybe another record with Dutch).  After that it's up to the code to determine the language based on the matching record(s).

    A slightly better approach would be to build an array or table of all the characters in the string and use an inner join to the 'character-to-language' table.

    Of course the shortcoming here is if the text does not contain a character or pattern of characters unique to a language, this approach wouldn't tell you much.  San Diego could be city in the US, Mexico, Spain or most of South America.

    I am assuming your values are not limited to city names. 

    I hope someone comes up with something better.  This problem intrigues me.

  • Can't wait to see the solution to this one .

  • This may sound silly, but first I think you need to define your meaning of "script".

    Is "American English" a script, distinct from the "Italian" script? Native Italian script lacks some letters in American English script, but they are anyway used for foreign representation. American English script lacks some accented letters in Italian script.

    An analogous question can be asked for Russian Cyrillic script and Bulgarian Cyrillic -- do you consider these two scripts, or do you consider some sort of superset as one general Cyrillic script?

    If I recall correctly, Korean Hangul are in Unicode both in composed (Jambo) form, and in uncomposed form -- do you consider these as two different scripts?

    What do you do about the unified CJK area? It contains both traditional and simplified Chinese characters, which are of course used in traditional Chinese, in simplified Chinese, in Korean, and in Japanese. How many scripts are you considering here as distinct?

    Also, I recall that there are some ancient Greek letters in Unicode, which are not -- at least, I wouldn't consider them -- part of the modern Greek alphabet. Are they part of your concept of "Greek script" ?

  • Perry,

    you have nailed it pretty much on the head. There are (in several cases at least) multiple options of "dirived" scripts and I do need to be able to determine the "dialects" particular script. There are also the considerations of while a script may be valid, the use of it is almost extinct/non-existant.

    So, yes I do need to be able to tell if the Cyrillic word is Bulgarina, Russian, Romanian and so on.

    Ya gotta love globalization as a DBA!!!! 🙂

  • > So, yes I do need to be able to tell if the Cyrillic word is Bulgarina, Russian, Romanian and so on.

    Ah, wait, this sounds different from (how I read) your original post. I probably didn't realize the ramifications at first. This sounds like you want to recognize words. This is going to clearly require dictionaries, yes? Just start at the top and look up each word in a lot of dictionaries, and set some limits and thresholds, and go until you pass them (or give up).

    Perhaps, check languages until find target (or give up after 500 words)

    - Rules (I just made up some):

    ** Find at least 10 words which occur in target language

    ** Find at least 5 more words in target language than in any other

    ** (#words found in target)/(total #words checked) >= 95%

    Obviously that last rule is going to require that all relevant kanji be counted as Japanese (as well as being counted as traditional Chinese if appropriate, simplified Chinese if appropriate, and hanja if appropriate).

    That way, a nice mix of kanji and hiragana should quickly accomplish target=Japanese winning (because the hiragana words won't match anything else but Japanese).

    Issues:

    1) you need dictionaries for all possible target languages

    2) I haven't touched the subject of text encoding. Ugh.

    It would be easier if you could first identify text encoding, and then apply the dictionary/target language matching algorithm above, but in practice, they may need to be interrelated, as you may not be able to determine text encoding without being able to interpet the text, at least enough to distinguish valid from invalid sequences.

    (Looks like this forum loses whitespace, so I tried adding some symbols to help delimit my rules list.)

  • I'm probably stating the obvious, but some short phrases or words are going to be necessarily ambiguous -- that is, you can't really determine a language for such a small sample. To give some tiny & pretty obvious examples:

    xerox

    no

    ??

    2005

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply