Recursive replace Japanese to English characters

  • Hi there,

    I have a table with Employeename, companyname, companyaddress all in Japanese in a table called Japanese employees that needs transiliteration.

    They are all nvarchar(500).

    Create Table JapaneseEmployees(EmpName nvarchar(500), CompanyName nvarchar(500), CompanyAddress nvarchar(500)).

    Table has about 6 million records. I was told to transliterate this table to English.

    insert into Japaneseemployees(N'エイミー',N'マイクロソフト',N'1メインロード、東京');

    insert into Japaneseemployees(N'佐古',N'トヨタ',N'470サイドセント、東京');

    I created a table to store Japanesechar & replacechar in this table

    CREATE TABLE conversiondict (

    ID int identity(1,1),

    jpchar NVARCHAR(50) ,

    engchar NVARCHAR(50)

    )

    INSERT INTO conversiondict VALUES(N'あ', 'a')

    INSERT INTO conversiondict VALUES(N'い', 'i')

    INSERT INTO conversiondict VALUES(N'う', 'u')

    INSERT INTO conversiondict VALUES(N'え', 'e')

    INSERT INTO conversiondict VALUES(N'お', 'o')

    INSERT INTO conversiondict VALUES(N'か', 'ka')

    INSERT INTO conversiondict VALUES(N'き', 'ki')

    INSERT INTO conversiondict VALUES(N'く', 'ku')

    INSERT INTO conversiondict VALUES(N'け', 'ke')

    --this is only a apart of inserts..I have many more

    I do not know how to start writing recursive replace. Can anyone give me hints?

  • Try this. I changed some of your test data a bit but based on the test data I created it seems to work. Note...I only tested this on a 2 row test table so not sure of any performance implications so I'd recommend testing on a larger data set on your end before implementing.

    Create Table dbo.JapaneseEmployees

    (

    --Added an identity column here b/c this will make it much easier to work with the records once we split them out

    EmployeeID int identity(1,1)

    ,EmpName nvarchar(500)

    ,CompanyName nvarchar(500)

    ,CompanyAddress nvarchar(500)

    )

    --Insert test records...modified these inserts to make sure the characters are in the dbo.conversiondict table --below:

    insert into Japaneseemployees values (N'?????'

    ,N'??????????'

    ,N'???');

    insert into Japaneseemployees values (N'?????'

    ,N'???????'

    ,N'???');

    --Create the conversion table:

    CREATE TABLE dbo.conversiondict (

    ID int identity(1,1),

    jpchar NVARCHAR(50) ,

    engchar NVARCHAR(50)

    )

    --Populate it:

    INSERT INTO dbo.conversiondict VALUES(N'?', 'a')

    INSERT INTO dbo.conversiondict VALUES(N'?', 'i')

    INSERT INTO dbo.conversiondict VALUES(N'?', 'u')

    INSERT INTO dbo.conversiondict VALUES(N'?', 'e')

    INSERT INTO dbo.conversiondict VALUES(N'?', 'o')

    INSERT INTO dbo.conversiondict VALUES(N'?', 'ka')

    INSERT INTO dbo.conversiondict VALUES(N'?', 'ki')

    INSERT INTO dbo.conversiondict VALUES(N'?', 'ku')

    INSERT INTO dbo.conversiondict VALUES(N'?', 'ke')

    /*Now we want to recursively split our records apart so we can get each Japanese character in the string on its own row

    so we can then join back to the conversion table to find the English character equivalent

    */

    --JapanEmp CTE is a recursive CTE to split out all the characters of the string into their own row

    ;with JapanEmp as

    (

    select EmployeeID,

    substring(EmpName, 1, 1) as EmpNameChars,

    stuff(EmpName, 1, 1, '') as EmpNameData,

    substring(CompanyName, 1, 1) as CompanyNameChars,

    stuff(CompanyName, 1, 1, '') as CompanyNameData,

    substring(CompanyAddress, 1, 1) as CompanyAddressChars,

    stuff(CompanyAddress, 1, 1, '') as CompanyAddressData,

    1 as RowID

    from Japaneseemployees

    union all

    select EmployeeID,

    substring(EmpNameData, 1, 1) as EmpNameChars,

    stuff(EmpNameData, 1, 1, '') as EmpNameData,

    substring(CompanyNameData, 1, 1) as CompanyNameChars,

    stuff(CompanyNameData, 1, 1, '') as CompanyNameData,

    substring(CompanyAddressData, 1, 1) as CompanyAddressChars,

    stuff(CompanyAddressData, 1, 1, '') as CompanyAddressData,

    RowID + 1 as RowID

    from JapanEmp

    where

    (

    len(EmpNameData) > 0

    or(len(CompanyNameData) > 0)

    or(len(CompanyAddressData) >0)

    )

    )

    --Start the actual converting here...join back to the convserion table to fetch the english characters

    ,Conversion as

    (

    select

    JE.EmployeeID

    ,JE.RowID

    ,EmpNameConversion.EngChar as EmpNameEngChar

    ,CompanyNameConversion.EngChar as CompanyNameEngChar

    ,CompanyAddressConversion.EngChar as CompanyAddressEngChar

    from JapanEmp JE

    --Fetch the engchar conversions from our conversion table

    --Employee Name

    outer apply (select engchar from dbo.conversiondict cd where JE.EmpNameChars = cd.jpchar) EmpNameConversion

    --Company Name

    outer apply (select engchar from dbo.conversiondict cd where JE.CompanyNameChars = cd.jpchar) CompanyNameConversion

    --Company address

    outer apply (select engchar from dbo.conversiondict cd where JE.CompanyAddressChars = cd.jpchar) CompanyAddressConversion

    )

    --Concatenate the English characters back together and make the final select:

    Select

    distinct

    --Use the stuff and XML method to smush the characters back together based on employeeID

    EmployeeID

    --Employee Name

    ,STUFF(

    (Select A.EmpNameEngChar as [text()]

    from Conversion A

    where A.EmployeeID = B.EmployeeID

    FOR XML PATH('')),1,0,'') AS FinalEmpNameEnglish

    --Company Name

    ,STUFF(

    (Select A.CompanyNameEngChar as [text()]

    from Conversion A

    where A.EmployeeID = B.EmployeeID

    FOR XML PATH('')),1,0,'') AS FinalCompanyNameEnglish

    --Company Address

    ,STUFF(

    (Select A.CompanyAddressEngChar as [text()]

    from Conversion A

    where A.EmployeeID = B.EmployeeID

    FOR XML PATH('')),1,0,'') AS FinalCompanyAddressEnglish

    from Conversion B

    order by EmployeeID

  • Not underestimating adaddio's solution which is the base for my solution, I just made some changes. Using a recursive query as a counter is a bad idea and can generate horrible results as shown here: http://www.sqlservercentral.com/articles/T-SQL/74118/.

    I changed the recursive query to use a ctetally and created an inline table-valued function to make it easier to code without affecting performance.

    The function is the following:

    CREATE FUNCTION [dbo].[TranslationJapEng](@pString [nvarchar](4000))

    RETURNS TABLE WITH SCHEMABINDING

    AS

    RETURN

    --===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...

    -- enough to cover NVARCHAR(4000)

    WITH E1(N) AS (

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    ), --10E+1 or 10 rows

    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

    cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front

    -- for both a performance gain and prevention of accidental "overruns"

    SELECT TOP (ISNULL(LEN(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

    )

    --===== Replace the characters using the table.

    SELECT CAST((SELECT c.engchar + ''

    FROM cteTally t

    JOIN dbo.conversiondict c ON c.jpchar = SUBSTRING(@pString, t.N, 1)

    ORDER BY t.N

    FOR XML PATH('')) AS varchar(max)) Item

    ;

    GO

    With that function, your code would be something like this:

    SELECT Name.Item AS EmpName,

    Company.Item AS CompanyName,

    CompAddress.Item AS CompanyAddress

    FROM Japaneseemployees je

    CROSS APPLY dbo.TranslationJapEng( EmpName) Name

    CROSS APPLY dbo.TranslationJapEng( CompanyName) Company

    CROSS APPLY dbo.TranslationJapEng( CompanyAddress) CompAddress

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Luis Cazares (10/3/2014)


    Not underestimating adaddio's solution which is the base for my solution, I just made some changes. Using a recursive query as a counter is a bad idea and can generate horrible results as shown here: http://www.sqlservercentral.com/articles/T-SQL/74118/.

    I changed the recursive query to use a ctetally and created an inline table-valued function to make it easier to code without affecting performance.

    The function is the following:

    CREATE FUNCTION [dbo].[TranslationJapEng](@pString [nvarchar](4000))

    RETURNS TABLE WITH SCHEMABINDING

    AS

    RETURN

    --===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...

    -- enough to cover NVARCHAR(4000)

    WITH E1(N) AS (

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    ), --10E+1 or 10 rows

    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

    cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front

    -- for both a performance gain and prevention of accidental "overruns"

    SELECT TOP (ISNULL(LEN(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

    )

    --===== Replace the characters using the table.

    SELECT CAST((SELECT c.engchar + ''

    FROM cteTally t

    JOIN dbo.conversiondict c ON c.jpchar = SUBSTRING(@pString, t.N, 1)

    ORDER BY t.N

    FOR XML PATH('')) AS varchar(max)) Item

    ;

    GO

    With that function, your code would be something like this:

    SELECT Name.Item AS EmpName,

    Company.Item AS CompanyName,

    CompAddress.Item AS CompanyAddress

    FROM Japaneseemployees je

    CROSS APPLY dbo.TranslationJapEng( EmpName) Name

    CROSS APPLY dbo.TranslationJapEng( CompanyName) Company

    CROSS APPLY dbo.TranslationJapEng( CompanyAddress) CompAddress

    Great solution Luis. I tested out of curiosity and your solution returned 100k translated rows in 41 seconds compared to 194 seconds for mine. I've seen the Inline CTE Driven Tally Table construct on here a few times now but this was a good example to really help me learn and understand it. I'll be sure to start working it into my code when necessary. Thanks!

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply