Introduction to
Internationalization & LocalizationGlobalization of software applications
Start > Productions >
Articles >
Introduction to Internationalization & Globalization |
Updated 2011.03.04 20:34 +0100 |
Ikke tilgængelig på dansk
By Michael Suodenjoki, michael@suodenjoki.dk.
Version 1.1, March 2011 - Minor update in Localization Tools section.
Version 1.0, October 2001
Contents
Introduction
I have been interested in internationalization and localization for some years now. My entry point into the field have been from the technical side.
I'm working in a company that for over a decade have targeted the international software market - but for some reason the software (mainly C++
based Windows programs) have not yet meets its real challenges for some of the more complex regions of the world - e.g. Asian or Arabic
countries. Therefore the programs must be ready to cope with these special problems when they hopefully occur (characters sets, scripts,
translation, keyboard IME editors, UNICODE, resource files, multilingual user interfaces etc.)
On this page I have collected most of my gathered information from the Internet. It includes a huge collection of links
to relevant companies, organization and articles. I hope that you - just as I - can use them.
For a newcomer the localization industry and translation industry can be quite confusing. The industry is divided into a lot of different
organizations and companies selling of wide range of products that often
are difficult to cope and understand. A
lot of special terminology is used and a lot of different (file) formats are
available. There are luckily tendencies towards more standardization. Some organizations have been trying to standardize file interchange
formats. Typically these organization are supported by a group of companies providing the real tools and products that you often must pay of
lot of money for. And that doesn't necessary guarantee that your requirements is meet.
For a introduction into the localization subject you may read my "A small introduction" section below, however I
suggest browsing through the terminology section below first.
- Character
- A character in software development is an abstraction. The natural understanding of a character is that of a written character; one
intuitively associates a certain graphic representation with a given character. This is what is called a glyph: the actual shape of a
character image. A glyph appears on a display or is produced by a printer. Naturally, there can be many such representations. You can
represent the characters ABC as: ABC or ABC or ABC.
A set of glyphs is called a font. So indeed, one aspect of a character is its graphic representation. However, for the purpose of data
processing in software development, a character also needs to have a data representation as a sequence of bits. This is called a code.
- Character Code
- A character code is a sequence of bits representing a character. Again, there are many such representations. The character a, for
instance, can be represents as 0x61 in ASCII or as 0x81 in EBCDIC or as 0x0061 in Unicode. From this example you can see that not only the bit
pattern but also the number of bits used for representing a character can vary; the bit pattern representing the character a has 16 bits in
Unicode but only 8 bits in ASCII and EBCDIC.
- Codeset
- A character codeset is a 1:1 mapping between characters and character codes.
- Encoding
- A character encoding scheme is a set of rules for translating a byte sequence into a sequence of character codes.
- g11n
- The abbreviation for globalization - 11 characters between g and n.
- Globalize, Globalization
- The term is used for the internationalization and the localization process together or the concept to produce software that works
globally. The Localization Industry Standards Association (LISA) defines globalization as follows:
"Globalization addresses the business issues associated with taking a product global. In the globalization of high-tech products this
involves integrating localization throughout a company, after proper internationalization and product design, as well as marketing, sales, and
support in the world market."
- i18n
- The abbreviation for internationalization - 18 characters between i and n.
- Internationalize, Internationalization
- The process of enabling your source for the international market. Internationalization is the design and development of software in a way
that allows it to be localized (translated) to other locales (languages) without the need to alter the source code. Common errors are due to
both cultural and locale differences. The Localization Industry Standards Association (LISA) defines internationalization as follows:
"Internationalization is the process of generalizing a product so that it can handle multiple languages and cultural conventions without the
need for re-design. Internationalization takes places at the level of program design and document development."
- Language Engineering
- The process of converting human knowledge of a language into a computer model, so that computer programs can utilize this knowledge, e.g.
to automatic translation. The Euromap Report, published on behalf of the EUROMAP Consortium in 1998, defined language engineering as follows:
"Language engineering is the application of knowledge of written and spoken language to the development of information, transaction and
communication systems, so that they can recognize, understand, interpret, and generate human language. Language technologies include, for
example, automatic of computer assisted translation (CAT), speech recognition and synthesis, speaker verification, semantic searches and
information retrieval, text mining and fact extraction."
- Locale
- The features of the user's environment that are dependent on language, country, and cultural conventions. The locale determine convents
such as sort order; keyboard layout; date, time, number and currency formats. In Windows, locales usually provide more information about
cultural conventions that about language's. So simply put locales are simply a bunch of user preference information that's related to the
user's language and sub-language.
- l10n
- The abbreviation for localization - 10 characters between l and n.
- Localize, Localization
- The process of adapting, translating and customizing a product for a specific market (for a specific locale). The Localization Industry
Standards Association (LISA) defines localization as follows:
"Localization involves taking a product and making it linguistically and culturally appropriate to the target locale (country/region and
language) where it will be used and sold."
- Multilingual, Multilanguage
- Supporting more than one language simultaneously. Often implies the ability to handle more than one script of character sets.
- Script
- A system of characters used to write one or several languages.
- Translation
- The process of translating a piece of text from one language (the source language) to another language (the target language). The
translation process is one of the major parts of localization. Thus a localization project; in addition to translation, also involves many
other tasks such as project management, software engineering, testing, and desktop publishing.
- Translation Memory
- A collection of translations of words, terms, phrases or sentences from a source language to a target language. The translation memory
can be used by tools to automate more or less intelligently, provide a framework where the tools suggests translations or to ensure that the
same words, terms, phrases or sentences of the source language are translated into the same words, terms, phrases or sentences of the target
language. Translation Memory are typically stored in standard file formats (e.g. TMX or TBX) or databases. A simple sample is the Microsoft Glossaries providing the translations of the most commonly used
Windows terms (e.g. for menu item texts). Note that a file representing a translation memory is not necessarily fit for
communication of localization translations (in a localization package) - or a particular translation of a particular application. It merely
represents typical translations, or translations for a particular industry (glossaries), or translations for a suite of products, which can be
used by the translator to help (more or less automatically) with the translation.
When you begin to internationalize your software product there will be several steps that must be considered and handled. In this small
introduction I will try to describe these steps. I'm not a professional localizer so there may be issues that I have described incorrectly or
too simple. But let's start:
-
Internationalization. First you must internationalize your source code. This means that the code should be programmed/authored in
such a manner that is doesn't dependent on any specific locale - that is language or cultural conventions. All the locale dependent stuff
should preferably be put into separate files - those files that later on can be localized (translated) into a new locale (language). This
is not a simple process and it involves many considerations dependent on the nature of your source code (whether it is e.g. C++ source
files or HTML pages). You can buy thick books describing these considerations. It's important that all involved
personnel (programmers and authors) are told what to do to implement internationalization (e.g. via standards).
-
Preparation (pre-processing). When the source code is internationalization ready the locale dependent part (the localization
source) is ready to be localized. The hardest work of the localization is often basic translation of the text. The text to translate is
provided in a source language (e.g. English) and the language to translate to is called the target language (e.g. Danish).
Often you cannot yourself translate the text because you don't speak/understand the target language. Therefore you must engage a
third-party to do the translation. This could be a person (a translator e.g. a freelance translator) , a translator company or maybe even
your customer. In any case you must supply the source to be translated in some form to the third-party. Often you need to prepare the
localization source in some form - either because it must be packed (for nice transportation) or because it needs extra information needed
during translation. Both cases are often necessary. The extra information could be such things as the source language id, the target
language id, information about how to translate, tone of language, context information, your standard glossaries (terms), the current
status of the translation etc etc. It can be very complex - and it often are very complex. You can spend (a lot of) money buying tools
that helps you with preparing your locale dependent source, but it is not easy to find a tool that suits your specific needs. Hopefully
standardization will help in the future.
-
Localization. When you have prepared your localization packet you can send it to localization (translation). A couple of weeks
later you hopefully receive the result of the localization from the third-party (the localization target). The ultimate goal is that the
localization package contain everything needed for localization - but it is typically not that simple. Translation for example is often
highly dependent on the context. Or the localizer must be expert into the product for being able to localize it satisfactory. Such things
are difficult to capture in a software file.
-
Merging. When you have received the result of the localization you extract the localization target from the localization packet
and merge it into your new localization source. If your version of the localization source haven't changed from the last time you send it
to localization the merge process is simple. Otherwise the merge process can be quite complex and most likely need human intervention.
-
Post-processing. To finalize you sometimes need to post process the localization source (e.g. compile or link) to produce the
final localized product. This step should preferably be as much independent of the locale independent source as possible. The best
situation occur when you can distribute the localization source (or the post processed localization source) by itself to the customers
thereby providing them with a system that miraculous supports a new locale.
-
Quality Assurance. When you have post-processed you can start checking whether everything is localized correctly. Again this is
difficult. Often it is a huge amount of text that must be checked and the product itself much be tested rigorously. And the result - you
can start a new preparation for localization.
It is clear that the localization by the third party can be most efficient if it is possible to actually see the result of the localization
on location (directly in the product) . It is much more easy to see whether localization have succeeded in the real live product. Thus the
post-processing should preferably be available for the third-party localizer. Often though this is difficult due to the complexity of the
post-processing (e.g. compilations or third-party tools that needs to be licensed).
The degree of an application's internationalization can be divided into four major levels:
- No international support: The application works in one language. If that language is not English, it probably works on only
specific language versions of Windows.
- Locale-dependent source: Different code base must be written and maintained for European, Far Eastern and Middle Eastern
versions.
- Single-source, locale-dependent binary: A single code base is written but separate compilations must be made for different
languages or different Windows versions.
- Single-source, single-binary: A single code base and single compilation satisfies all language and platform versions.
Top tips to ensure your code is internationalized (courtesy of http://www.alchemysoftware.ie/work/workzone.html)
- Eliminate UI length restrictions. Translated strings are typically longer than the English.
- Ensure support for accented characters including double byte.
- Check for hard coded strings
- Enable support for foreign keyboard layouts
- Avoid fixed date, time, currency or number formats
- Avoid country specific language or jargon
- Avoid text in bitmaps as they are hard to edit
Note: that this list of tips is very small - there are many more things to consider.
List of some localization/translation memory tools available:
4.1 Internationalization Examination Tools
These kind of tools examines your programs for possible problems with respect of using them in different locales (e.g. with respect to
language, type of operating system, character sets and so on).
4.2 Internationalization Programming Libraries
Provides programming libraries that helps you internationalize your source code.
- International
Components for Unicode (ICU). Provides an open source set of components in C++ or Java on a variety of platforms (including Windows).
- The Dinkum CoreX library. Provides a set of character set converters (std::codecvt) built
for Standard C++'s locale/facet framework.
- GNU's libiconv library provides a set of C functions that support (character)
conversion between a wide set of encodings.
- GNOME libunicode library. It covers character set conversion, character
properties, decomposition etc. The GNU libiconv library is probably a better choice - it is more complete and more actively maintained. Note:
that there are a few different libraries named libunicode out there. I know of one
libunicode at sourgeforge which implements a set of C functions handling Unicode strings. Essentially these maps the normal string
functions available in C, e.g. like strcpy, strlen etc. Unicode versions of these are already available in most of today's C++ compilers.
- Rosette Core Library for Unicode. Rosette Core Library for
Unicode enables software engineers to quickly add support for over 150 of the world’s languages to their applications. Rosette Core Library
for Unicode is a Unicode development library built based on Basis Technology’s experience implementing multilingual compliance into
mission-critical systems in many different environments. Developers deploying Rosette Core Library for Unicode achieve multilingual support in
their applications efficiently and economically.
- Free recode library by François Pinard. I do not know much about
this library but it seems to support more encodings that GNU's libiconv library. However the API is not following any standards.
- Microsoft Layer for Unicode (MSLU). Provides a layer for
running Unicode enabled applications on Windows 9x, ME.
- RapidSolution (German) have the RapidTranslation.
Note that C++ have its own standard that contains support for locales - the Standard C++ Library (previously STL). However it does not
include a standard way of converting between encodings.
- [Dr.Intl. 2002]
- Developing International Software by Dr. International, Microsoft Press, October 2002. An updated version of Nadine Kano's
book from 1995 [Kano 1995]. More info at
http://www.microsoft.com/globaldev/DIS_v2/disv2.asp
- [Deitsch & Czarnecki 2001]
- JAVA Internationalization by Andrew Deitsch and D. Czarnecki,
O'Reilly, 2001. ISBN 0-596-00019-7.
- [Esselink 2000]
- A Practical Guide for Localization (2nd Edition) by Bert Esselink, John Benjamins Pub. Co., 2000. ISBN 1588110060. For more information see
www.locguide.com
- [Kaplan 2000]
- Internationalization with Visual Basic by Michael S.
Kaplan, Sams, 2000. ISBN 0-672-31977-2. More information at
www.i18nwithvb.com
- [Langer & Kreft 2000]
- Standard C++ IOStreams and Locales by Angelika Langer &
Klaus Kreft, Addison-Wesley, 2000. ISBN 0-201-18395-1. Langer &
Kreft have also published an article The Locale Framework in
the magazine C++ Report, September 1997, pp. 58-66(69). More info at
http://home.camelot.de/langer/Articles/Internationalization/I18N.htm
- [Schmidt 2000]
- International Programming for Microsoft Windows by David A. Schmidt, Microsoft Press, 2000. ISBN 1-57231-956-9. Essential
guidelines for globalizing and localizing your software with examples in Microsoft Visual C++ 6.0. Covers features for Windows 2000.
- [Unicode 2000]
- The Unicode Standard: Version 3.0 by The Unicode
Consortium, Addison-Wesley, 2000. ISBN 020-16-16335. See also
www.unicode.org.
- [Lunde 1999]
- CJKV Information Processing by Ken Lunde, O'Reilly, 1999.
ISBN 1-56592-224-7. More information at
www.oreilly.com/catalog/cjkvinfo
- [Ott 1999]
- Global Solutions for Multilingual Applications by Chris
Ott, Wiley, 1999. ISBN 0-471-34827-9
- [Kano 1995]
- Developing International Software by Nadine Kano. Microsoft Press, 1995. ISBN 1-55615-840-8. For Windows 95 and Windows NT.
A handbook for international software design. Can be read online at http://www.microsoft.com/globaldev/dis_v1/disv1.asp or via MSDN
at
http://msdn.microsoft.com/library/books/devintl/S24AA.htm.
Bert Esselink has in his book [Esselink 2000] a further reading section that also is available on the Internet at www.locguide.com/references/publications/books.htm. Its a huge collection
of reference material - though many of the books are outdated.
Furthermore Sybase has a list of books on internationalization and
localization. Again many of them are rather outdated, but includes a broader variety on operating systems like Macintosh,
internationalization in libraries like X Windows and more general usability books.
6.2 Localization Information Portals
6.3 Newsgroups
6.4 Mailings Lists
6.5 Miscellaneous / Not yet grouped
6.6 Private persons interested in localization
This section contains a compiled list of file formats used by the localization and translation industry. Some of the formats are
standardized openly whereas others are company proprietary. The list contains only a limited set of the file formats of source files - the
original files that contains the text to be translated.
If you know of other formats please contact me.
Extension |
Base format |
Description |
Sample Tool |
Proprietor |
.rc |
ANSI Text |
Resource Script files containing definitions of Windows resources such as dialogs, icons, bitmaps, menus and textual
strings. |
Microsoft Resource Editor, Visual Studio or normal text editor |
Microsoft
www.microsoft.com |
.resx |
XML |
.NET Resource Script files (new .rc format for the .NET platform) |
|
Microsoft
www.microsoft.com |
.res |
Binary |
Resource Files - contains compiled Windows Resource Script files. Resource files are stored directly in executables (exe/dll) as
Unicode. |
Microsoft Resource compiler (rc.exe) |
Microsoft
www.microsoft.com |
.ressource |
Binary |
.NET Resource Files - contains compiled .NET Resource Script files. Resource files are stored directly in executables
(exe/dll). |
|
Microsoft
www.microsoft.com |
.xlf |
XML |
XLIFF - XML Localization Interchange File Format. Provides an open standard for transporting text to be translated. |
|
OASIS
http://www.oasis-open.org/committees/xliff/ |
.xml |
XML |
OpenTag- Another format for translation memory exchange. |
|
|
.tmx |
XML |
Translation Memory eXchange format. A format containing information about already translated words, phrases or sentences. |
|
LISA
www.lisa.org |
.ttk |
? (Binary) |
Translation Tool Kits. |
CATALYST™ |
Alchemy Software
www.alchemysoftware.ie |
.skl |
Binary |
? |
RWS Tools (Rainbow) |
RWS Group www.translate.com |
.itd |
? |
Intermediate Translation Document |
SDLX Tools |
SDL International
www.sdlintl.com |
.tdb |
? |
Terminology Database. A kind of translation memory with special focus of terminology (words and phrases). |
SDLX Tools |
|