Storing data in multiple languages

Question

Storing data in multiple languages

Andrew_Webster

SSChasing Mays

Points: 630
More actions
July 23, 2007 at 9:14 am

#174523

Hi guys.
I'm rebuilding a dismally written application. Originally it stored data in Access tables in both English and Spanish, with a column for English text and a column for Spanish text in each and every table that held textual data. Not exactly normalised, and I don't even have the VB 6 source code for the front end, but I'm pretty certain there's a nightmare of dynamic SQL hidden away inside.
The requirement for the rebuild is that is can handle multiple languages; not just English and Spanish but anything. There's no requirement to localize, it will be up to the user to select the language.
For each area of data, tablespace if you will, I can see two way of doing this: either have all the data tables consist essentially of keys that link to look-up tables that contain the data in the various languages with each table qualified by a language_id FK to a Languages table, or have just a top-level table qualified by language, with the tree of related data and look-ups just hanging off it.
Not very well explained, but if you've wresteled with this, I've a feeling you'll recognise my dilemma.
The first way seems to have the benefit of making it slightly easier to identity where data might be missing in one of the languages.
Any comments or suggestions? Have you ever had to stored data in multiple languages where the data is intended to all be the same just with duplicates in different languages?
TIA.

Viewing 6 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply

Michael Earl-395764 SSC Guru Points: 53873 More actions · Answer 1

I think the options you have explained are either to have enire records duplicated - once for each language, or to only have the text data duplicated. The simple answer to this is to only duplicate the text data. I do not know how complicated your data model is, but if you have a lot of non-text information stored in the database and you intend to simply add two records to each table - one for English and one for Spanish, you will have to make sure anything that updates a different field does so on both records. This is going to either be a lot of triggers, or simply be complicated each time you make an update.

So, regardless of the approach, you are probably going to have places in which you have more than one record and both need to be updated together. This is a data integrity issue that may be difficult to overcome. Ideally, if the text data is really supposed to be the same statement in multiple languages, you should be putting the change for every language in a single transaction or you run the risk of something being changed in one language and not another. I really hope you don't have to worry about this. Sadly, the original approach of multiple columns really solved this issue in some sense.

Personally, I would probably add a single table for all of the multi-language text. Have a primary key that indicates the text line and the language and then the text data itself. (using VARCHAR, NVARCHAR, or VARCHAR(MAX) rather than TEXT data type if possible). Then, I would have a text line ID in each table in the database rather than the text itself. This would keep the data duplication to a minimum. It does, however, make it more difficult to ensure the integrity of each language line.

A couple of questions though. First, do you keep text for each language on every line - I mean, when someone changes some text, do they have to do so in every language, or do you just change the text in the language they are working in? How often does this text data change? If this is setup information for displaying labels or other information within the application, there are possibly better approaches.

Andrew_Webster SSChasing Mays Points: 630 More actions · Answer 2

Well, I've already gone with separating out the textual values. It means that the actual data tables are a bit abstract, but certainly beautifully normalised!

At the moment, this version will simply take data entered in the old horror, then transferred, so handling the updates in the two current languages is out of my hands. However, looking ahead, we'll be building a data-entry tool for the new version, and at that stage, I think I'll be either using triggers (and providing loads of documentation for the next guy) to create records in all languages or it might be built into the UI/Business Rules. I was also thinking of having a field in the languages table that can hold a default "<this value needs to be entered>" message in each language that can be inserted as required.

I may also add a description field to the otherwise totally abstract tables in English as a default language, as well as having an English set of entries in the text tables.

When it comes to making the updates and keeping the different translations in sync, that is and always will be a purely manual process, although I can certainly build something to bring incomplete records to the user's attention. Moreover, the text changes once in a blue moon - it's scientific information - so this is not going to be a huge issue.

I'm not keeping ALL the text in a single table for the entire system - I've tried that before and usually found that it get horribly messy. It's rare that there is enough text re-use in more than one area to warrant it. In fact, in the old version, there's a "Notes" table and a "Descriptions" table linked all over the place. A day of checking it (that's the kind of job where I adore Access!) I've shown that none of the entries is used more than once! So I'm shifting them all back to live with the data to which they pertain, albeit in a bunch of <this_kind_of_data>_text fields. It's also allowing me to tailor the nvarchar widths to something appropriate, rather than the old system that fudged everything into 255. (Pick-a-number style design).

It's starting to form a shape now. I love this - taking someone else's messed up data and showing that instead of having nearly 200 tables with dozens of them having almost identical structures and culling and normalising it down to around fifty tables. Much more elegant, although the data transfer looks like it's going to have around 200 steps to it! A huge chunk of the work is going into building an Access data transfer tool (DTS is not an option for a bunch of reasons outside my control) and building this tool is actually driving the final data structure.

Thanks for the reply. I work with one other developer, who's a UI/object-head, and doesn't appreciate the beauty of normalisation and sweetly tuned indexes! It's good to unload data-thoughts to other data-heads.

Anders Pedersen SSChampion Points: 11410 More actions · Answer 3

I would store one record per language, just like sys.messages does. Makes it easy to add another language later if determined there is a need for it.

bledu SSChampion Points: 12335 More actions · Answer 4

A huge chunk of the work is going into building an Access data transfer tool (DTS is not an option for a bunch of reasons outside my control) and building this tool is actually driving the final data structure.

if you can be bothered, you can use smo with one of the .net languages to do some of the stuff that dts might have done.

and also remember to use nvarchar, nchar ... etc instead of varchar, char ... etc

is too much to ask what the application is doing? you could consider using the internationalisation features in .net

Everything you can imagine is real.

Andrew_Webster SSChasing Mays Points: 630 More actions · Answer 5

I'm not writing the front end, just handling the data. It's an application to store and display taxonomic information about biological species, a bit like ITIS. The client has specified that the front end has to be written in PHP, so we've got none of the snuggly .NET luxuries. I've got zero experience with PHP and it's another developer handling that, so unless/until we discover that PHP has a miraculous bunch of i18n powers, then we're doing it all in the data.

I'm going for handling the language just like it's another attribute of the data. In some areas where there's a tree of related look-up data I'm using the language_id in each table to form part of the join between tables. This allows me to have at each level a table that contains no more than an id and a sort_order, and a second table which uses that id as a FK, has a PK of it's own, a nvarchar field to hold the text and a language_id field. So then I'll write (maybe generate, might not be worth the effort writing the template) a bunch of SPs that take the language_id as a parameter and return the joined tables so that as far as the front end is concerned, the pair are just one table of an id, a sort order, and the text in the current language.