Recursive replace Japanese to English characters

Question

Recursive replace Japanese to English characters

rash3554

SSCommitted

Points: 1984
More actions
September 27, 2014 at 3:22 am

#397414

Hi there,
I have a table with Employeename, companyname, companyaddress all in Japanese in a table called Japanese employees that needs transiliteration.
They are all nvarchar(500).
Create Table JapaneseEmployees(EmpName nvarchar(500), CompanyName nvarchar(500), CompanyAddress nvarchar(500)).
Table has about 6 million records. I was told to transliterate this table to English.
insert into Japaneseemployees(N'エイミー',N'マイクロソフト',N'1メインロード、東京');
insert into Japaneseemployees(N'佐古',N'トヨタ',N'470サイドセント、東京');
I created a table to store Japanesechar & replacechar in this table
CREATE TABLE conversiondict (
ID int identity(1,1),
jpchar NVARCHAR(50) ,
engchar NVARCHAR(50)
)
INSERT INTO conversiondict VALUES(N'あ', 'a')
INSERT INTO conversiondict VALUES(N'い', 'i')
INSERT INTO conversiondict VALUES(N'う', 'u')
INSERT INTO conversiondict VALUES(N'え', 'e')
INSERT INTO conversiondict VALUES(N'お', 'o')
INSERT INTO conversiondict VALUES(N'か', 'ka')
INSERT INTO conversiondict VALUES(N'き', 'ki')
INSERT INTO conversiondict VALUES(N'く', 'ku')
INSERT INTO conversiondict VALUES(N'け', 'ke')
--this is only a apart of inserts..I have many more
I do not know how to start writing recursive replace. Can anyone give me hints?

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply

adaddio Mr or Mrs. 500 Points: 539 More actions · Answer 1

Try this. I changed some of your test data a bit but based on the test data I created it seems to work. Note...I only tested this on a 2 row test table so not sure of any performance implications so I'd recommend testing on a larger data set on your end before implementing.

Create Table dbo.JapaneseEmployees

(

--Added an identity column here b/c this will make it much easier to work with the records once we split them out

EmployeeID int identity(1,1)

,EmpName nvarchar(500)

,CompanyName nvarchar(500)

,CompanyAddress nvarchar(500)

)

--Insert test records...modified these inserts to make sure the characters are in the dbo.conversiondict table --below:

insert into Japaneseemployees values (N'?????'

,N'??????????'

,N'???');

insert into Japaneseemployees values (N'?????'

,N'???????'

,N'???');

--Create the conversion table:

CREATE TABLE dbo.conversiondict (

ID int identity(1,1),

jpchar NVARCHAR(50) ,

engchar NVARCHAR(50)

)

--Populate it:

INSERT INTO dbo.conversiondict VALUES(N'?', 'a')

INSERT INTO dbo.conversiondict VALUES(N'?', 'i')

INSERT INTO dbo.conversiondict VALUES(N'?', 'u')

INSERT INTO dbo.conversiondict VALUES(N'?', 'e')

INSERT INTO dbo.conversiondict VALUES(N'?', 'o')

INSERT INTO dbo.conversiondict VALUES(N'?', 'ka')

INSERT INTO dbo.conversiondict VALUES(N'?', 'ki')

INSERT INTO dbo.conversiondict VALUES(N'?', 'ku')

INSERT INTO dbo.conversiondict VALUES(N'?', 'ke')

/*Now we want to recursively split our records apart so we can get each Japanese character in the string on its own row

so we can then join back to the conversion table to find the English character equivalent

*/

--JapanEmp CTE is a recursive CTE to split out all the characters of the string into their own row

;with JapanEmp as

(

select EmployeeID,

substring(EmpName, 1, 1) as EmpNameChars,

stuff(EmpName, 1, 1, '') as EmpNameData,

substring(CompanyName, 1, 1) as CompanyNameChars,

stuff(CompanyName, 1, 1, '') as CompanyNameData,

substring(CompanyAddress, 1, 1) as CompanyAddressChars,

stuff(CompanyAddress, 1, 1, '') as CompanyAddressData,

1 as RowID

from Japaneseemployees

union all

select EmployeeID,

substring(EmpNameData, 1, 1) as EmpNameChars,

stuff(EmpNameData, 1, 1, '') as EmpNameData,

substring(CompanyNameData, 1, 1) as CompanyNameChars,

stuff(CompanyNameData, 1, 1, '') as CompanyNameData,

substring(CompanyAddressData, 1, 1) as CompanyAddressChars,

stuff(CompanyAddressData, 1, 1, '') as CompanyAddressData,

RowID + 1 as RowID

from JapanEmp

where

(

len(EmpNameData) > 0

or(len(CompanyNameData) > 0)

or(len(CompanyAddressData) >0)

)

--Start the actual converting here...join back to the convserion table to fetch the english characters

,Conversion as

(

select

JE.EmployeeID

,JE.RowID

,EmpNameConversion.EngChar as EmpNameEngChar

,CompanyNameConversion.EngChar as CompanyNameEngChar

,CompanyAddressConversion.EngChar as CompanyAddressEngChar

from JapanEmp JE

--Fetch the engchar conversions from our conversion table

--Employee Name

outer apply (select engchar from dbo.conversiondict cd where JE.EmpNameChars = cd.jpchar) EmpNameConversion

--Company Name

outer apply (select engchar from dbo.conversiondict cd where JE.CompanyNameChars = cd.jpchar) CompanyNameConversion

--Company address

outer apply (select engchar from dbo.conversiondict cd where JE.CompanyAddressChars = cd.jpchar) CompanyAddressConversion

)

--Concatenate the English characters back together and make the final select:

Select

distinct

--Use the stuff and XML method to smush the characters back together based on employeeID

EmployeeID

--Employee Name

,STUFF(

(Select A.EmpNameEngChar as [text()]

from Conversion A

where A.EmployeeID = B.EmployeeID

FOR XML PATH('')),1,0,'') AS FinalEmpNameEnglish

--Company Name

,STUFF(

(Select A.CompanyNameEngChar as [text()]

from Conversion A

where A.EmployeeID = B.EmployeeID

FOR XML PATH('')),1,0,'') AS FinalCompanyNameEnglish

--Company Address

,STUFF(

(Select A.CompanyAddressEngChar as [text()]

from Conversion A

where A.EmployeeID = B.EmployeeID

FOR XML PATH('')),1,0,'') AS FinalCompanyAddressEnglish

from Conversion B

order by EmployeeID

Luis Cazares SSC Guru Points: 183706 More actions · Answer 2

Not underestimating adaddio's solution which is the base for my solution, I just made some changes. Using a recursive query as a counter is a bad idea and can generate horrible results as shown here: http://www.sqlservercentral.com/articles/T-SQL/74118/.

I changed the recursive query to use a ctetally and created an inline table-valued function to make it easier to code without affecting performance.

The function is the following:

CREATE FUNCTION [dbo].[TranslationJapEng](@pString [nvarchar](4000))

RETURNS TABLE WITH SCHEMABINDING

AS

RETURN

--===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...

-- enough to cover NVARCHAR(4000)

WITH E1(N) AS (

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

), --10E+1 or 10 rows

E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front

-- for both a performance gain and prevention of accidental "overruns"

SELECT TOP (ISNULL(LEN(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

)

--===== Replace the characters using the table.

SELECT CAST((SELECT c.engchar + ''

FROM cteTally t

JOIN dbo.conversiondict c ON c.jpchar = SUBSTRING(@pString, t.N, 1)

ORDER BY t.N

FOR XML PATH('')) AS varchar(max)) Item

;

GO

With that function, your code would be something like this:

SELECT Name.Item AS EmpName,

Company.Item AS CompanyName,

CompAddress.Item AS CompanyAddress

FROM Japaneseemployees je

CROSS APPLY dbo.TranslationJapEng( EmpName) Name

CROSS APPLY dbo.TranslationJapEng( CompanyName) Company

CROSS APPLY dbo.TranslationJapEng( CompanyAddress) CompAddress

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

adaddio Mr or Mrs. 500 Points: 539 More actions · Answer 3

Luis Cazares (10/3/2014)
Not underestimating adaddio's solution which is the base for my solution, I just made some changes. Using a recursive query as a counter is a bad idea and can generate horrible results as shown here: http://www.sqlservercentral.com/articles/T-SQL/74118/.
I changed the recursive query to use a ctetally and created an inline table-valued function to make it easier to code without affecting performance.
The function is the following:
CREATE FUNCTION [dbo].[TranslationJapEng](@pString [nvarchar](4000))
RETURNS TABLE WITH SCHEMABINDING
AS
RETURN
--===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...
-- enough to cover NVARCHAR(4000)
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT TOP (ISNULL(LEN(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
--===== Replace the characters using the table.
SELECT CAST((SELECT c.engchar + ''
FROM cteTally t
JOIN dbo.conversiondict c ON c.jpchar = SUBSTRING(@pString, t.N, 1)
ORDER BY t.N
FOR XML PATH('')) AS varchar(max)) Item
;
GO
With that function, your code would be something like this:
SELECT Name.Item AS EmpName,
Company.Item AS CompanyName,
CompAddress.Item AS CompanyAddress
FROM Japaneseemployees je
CROSS APPLY dbo.TranslationJapEng( EmpName) Name
CROSS APPLY dbo.TranslationJapEng( CompanyName) Company
CROSS APPLY dbo.TranslationJapEng( CompanyAddress) CompAddress

Great solution Luis. I tested out of curiosity and your solution returned 100k translated rows in 41 seconds compared to 194 seconds for mine. I've seen the Inline CTE Driven Tally Table construct on here a few times now but this was a good example to really help me learn and understand it. I'll be sure to start working it into my code when necessary. Thanks!