September 27, 2014 at 3:22 am
Hi there,
I have a table with Employeename, companyname, companyaddress all in Japanese in a table called Japanese employees that needs transiliteration.
They are all nvarchar(500).
Create Table JapaneseEmployees(EmpName nvarchar(500), CompanyName nvarchar(500), CompanyAddress nvarchar(500)).
Table has about 6 million records. I was told to transliterate this table to English.
insert into Japaneseemployees(N'エイミー',N'マイクロソフト',N'1メインロード、東京');
insert into Japaneseemployees(N'佐古',N'トヨタ',N'470サイドセント、東京');
I created a table to store Japanesechar & replacechar in this table
CREATE TABLE conversiondict (
ID int identity(1,1),
jpchar NVARCHAR(50) ,
engchar NVARCHAR(50)
)
INSERT INTO conversiondict VALUES(N'あ', 'a')
INSERT INTO conversiondict VALUES(N'い', 'i')
INSERT INTO conversiondict VALUES(N'う', 'u')
INSERT INTO conversiondict VALUES(N'え', 'e')
INSERT INTO conversiondict VALUES(N'お', 'o')
INSERT INTO conversiondict VALUES(N'か', 'ka')
INSERT INTO conversiondict VALUES(N'き', 'ki')
INSERT INTO conversiondict VALUES(N'く', 'ku')
INSERT INTO conversiondict VALUES(N'け', 'ke')
--this is only a apart of inserts..I have many more
I do not know how to start writing recursive replace. Can anyone give me hints?
October 3, 2014 at 10:50 am
Try this. I changed some of your test data a bit but based on the test data I created it seems to work. Note...I only tested this on a 2 row test table so not sure of any performance implications so I'd recommend testing on a larger data set on your end before implementing.
Create Table dbo.JapaneseEmployees
(
--Added an identity column here b/c this will make it much easier to work with the records once we split them out
EmployeeID int identity(1,1)
,EmpName nvarchar(500)
,CompanyName nvarchar(500)
,CompanyAddress nvarchar(500)
)
--Insert test records...modified these inserts to make sure the characters are in the dbo.conversiondict table --below:
insert into Japaneseemployees values (N'?????'
,N'??????????'
,N'???');
insert into Japaneseemployees values (N'?????'
,N'???????'
,N'???');
--Create the conversion table:
CREATE TABLE dbo.conversiondict (
ID int identity(1,1),
jpchar NVARCHAR(50) ,
engchar NVARCHAR(50)
)
--Populate it:
INSERT INTO dbo.conversiondict VALUES(N'?', 'a')
INSERT INTO dbo.conversiondict VALUES(N'?', 'i')
INSERT INTO dbo.conversiondict VALUES(N'?', 'u')
INSERT INTO dbo.conversiondict VALUES(N'?', 'e')
INSERT INTO dbo.conversiondict VALUES(N'?', 'o')
INSERT INTO dbo.conversiondict VALUES(N'?', 'ka')
INSERT INTO dbo.conversiondict VALUES(N'?', 'ki')
INSERT INTO dbo.conversiondict VALUES(N'?', 'ku')
INSERT INTO dbo.conversiondict VALUES(N'?', 'ke')
/*Now we want to recursively split our records apart so we can get each Japanese character in the string on its own row
so we can then join back to the conversion table to find the English character equivalent
*/
--JapanEmp CTE is a recursive CTE to split out all the characters of the string into their own row
;with JapanEmp as
(
select EmployeeID,
substring(EmpName, 1, 1) as EmpNameChars,
stuff(EmpName, 1, 1, '') as EmpNameData,
substring(CompanyName, 1, 1) as CompanyNameChars,
stuff(CompanyName, 1, 1, '') as CompanyNameData,
substring(CompanyAddress, 1, 1) as CompanyAddressChars,
stuff(CompanyAddress, 1, 1, '') as CompanyAddressData,
1 as RowID
from Japaneseemployees
union all
select EmployeeID,
substring(EmpNameData, 1, 1) as EmpNameChars,
stuff(EmpNameData, 1, 1, '') as EmpNameData,
substring(CompanyNameData, 1, 1) as CompanyNameChars,
stuff(CompanyNameData, 1, 1, '') as CompanyNameData,
substring(CompanyAddressData, 1, 1) as CompanyAddressChars,
stuff(CompanyAddressData, 1, 1, '') as CompanyAddressData,
RowID + 1 as RowID
from JapanEmp
where
(
len(EmpNameData) > 0
or(len(CompanyNameData) > 0)
or(len(CompanyAddressData) >0)
)
)
--Start the actual converting here...join back to the convserion table to fetch the english characters
,Conversion as
(
select
JE.EmployeeID
,JE.RowID
,EmpNameConversion.EngChar as EmpNameEngChar
,CompanyNameConversion.EngChar as CompanyNameEngChar
,CompanyAddressConversion.EngChar as CompanyAddressEngChar
from JapanEmp JE
--Fetch the engchar conversions from our conversion table
--Employee Name
outer apply (select engchar from dbo.conversiondict cd where JE.EmpNameChars = cd.jpchar) EmpNameConversion
--Company Name
outer apply (select engchar from dbo.conversiondict cd where JE.CompanyNameChars = cd.jpchar) CompanyNameConversion
--Company address
outer apply (select engchar from dbo.conversiondict cd where JE.CompanyAddressChars = cd.jpchar) CompanyAddressConversion
)
--Concatenate the English characters back together and make the final select:
Select
distinct
--Use the stuff and XML method to smush the characters back together based on employeeID
EmployeeID
--Employee Name
,STUFF(
(Select A.EmpNameEngChar as [text()]
from Conversion A
where A.EmployeeID = B.EmployeeID
FOR XML PATH('')),1,0,'') AS FinalEmpNameEnglish
--Company Name
,STUFF(
(Select A.CompanyNameEngChar as [text()]
from Conversion A
where A.EmployeeID = B.EmployeeID
FOR XML PATH('')),1,0,'') AS FinalCompanyNameEnglish
--Company Address
,STUFF(
(Select A.CompanyAddressEngChar as [text()]
from Conversion A
where A.EmployeeID = B.EmployeeID
FOR XML PATH('')),1,0,'') AS FinalCompanyAddressEnglish
from Conversion B
order by EmployeeID
October 3, 2014 at 4:55 pm
Not underestimating adaddio's solution which is the base for my solution, I just made some changes. Using a recursive query as a counter is a bad idea and can generate horrible results as shown here: http://www.sqlservercentral.com/articles/T-SQL/74118/.
I changed the recursive query to use a ctetally and created an inline table-valued function to make it easier to code without affecting performance.
The function is the following:
CREATE FUNCTION [dbo].[TranslationJapEng](@pString [nvarchar](4000))
RETURNS TABLE WITH SCHEMABINDING
AS
RETURN
--===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...
-- enough to cover NVARCHAR(4000)
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT TOP (ISNULL(LEN(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
--===== Replace the characters using the table.
SELECT CAST((SELECT c.engchar + ''
FROM cteTally t
JOIN dbo.conversiondict c ON c.jpchar = SUBSTRING(@pString, t.N, 1)
ORDER BY t.N
FOR XML PATH('')) AS varchar(max)) Item
;
GO
With that function, your code would be something like this:
SELECT Name.Item AS EmpName,
Company.Item AS CompanyName,
CompAddress.Item AS CompanyAddress
FROM Japaneseemployees je
CROSS APPLY dbo.TranslationJapEng( EmpName) Name
CROSS APPLY dbo.TranslationJapEng( CompanyName) Company
CROSS APPLY dbo.TranslationJapEng( CompanyAddress) CompAddress
October 6, 2014 at 10:33 am
Luis Cazares (10/3/2014)
Not underestimating adaddio's solution which is the base for my solution, I just made some changes. Using a recursive query as a counter is a bad idea and can generate horrible results as shown here: http://www.sqlservercentral.com/articles/T-SQL/74118/.I changed the recursive query to use a ctetally and created an inline table-valued function to make it easier to code without affecting performance.
The function is the following:
CREATE FUNCTION [dbo].[TranslationJapEng](@pString [nvarchar](4000))
RETURNS TABLE WITH SCHEMABINDING
AS
RETURN
--===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...
-- enough to cover NVARCHAR(4000)
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT TOP (ISNULL(LEN(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
--===== Replace the characters using the table.
SELECT CAST((SELECT c.engchar + ''
FROM cteTally t
JOIN dbo.conversiondict c ON c.jpchar = SUBSTRING(@pString, t.N, 1)
ORDER BY t.N
FOR XML PATH('')) AS varchar(max)) Item
;
GO
With that function, your code would be something like this:
SELECT Name.Item AS EmpName,
Company.Item AS CompanyName,
CompAddress.Item AS CompanyAddress
FROM Japaneseemployees je
CROSS APPLY dbo.TranslationJapEng( EmpName) Name
CROSS APPLY dbo.TranslationJapEng( CompanyName) Company
CROSS APPLY dbo.TranslationJapEng( CompanyAddress) CompAddress
Great solution Luis. I tested out of curiosity and your solution returned 100k translated rows in 41 seconds compared to 194 seconds for mine. I've seen the Inline CTE Driven Tally Table construct on here a few times now but this was a good example to really help me learn and understand it. I'll be sure to start working it into my code when necessary. Thanks!
Viewing 4 posts - 1 through 3 (of 3 total)
You must be logged in to reply to this topic. Login to reply