FOSS Localization/Annex B: Technical Aspects

FOSS Localization

Foreword — Acknowledgements — Introduction — Localization Efforts in the Asia-Pacific — Recommendations — Annex A: Key Concepts — Annex B: Technical Aspects — Further Reading — Resources and Tools — Glossary — About the Authors — About APDIP — About IOSN

In this annex, more technical details will be discussed. The aim is to give implementers necessary information to start localization. However, this is not intended to be a hands-on cookbook.

Unicode

As a universal character set that includes all characters of the world, Unicode assigns code points to its characters by 16-bit integers, which means that up to 65,536 characters can be encoded. However, due to the huge set of CJK characters, this has become insufficient, and Unicode 3.0 has extended the index to 21 bits, which will support up to 1,114,112 characters.

Planes

Unicode code point is a numeric value between 0 and 10FFFF, divided into planes of 64K characters. In Unicode 4.0, allocated planes are Plane 0, 1, 2 and 14.

Plane 0, ranging from 0000 to FFFF, is called Basic Multilingual Plane (BMP), which is the set of characters assigned by the previous 16-bit scheme.

Plane 1, ranging from 10000 to 1FFFF and called Supplementary Multilingual Plane (SMP), is dedicated to lesser used historic scripts, special-purpose invented scripts and special notations. These include Gothic, Shavian and musical symbols. Many more historic scripts may be encoded in this plane in the future.

Plane 2, ranging from 20000 to 2FFFF and called Supplementary Ideographic Plane (SIP), is the spillover allocation area for those CJK characters that cannot fit into the blocks for common CJK characters in the BMP.

Plane 14, ranging from E0000 to EFFFF and called Supplementary Special-purpose Plane (SSP), is for some control characters that do not fit into the small areas allocated in the BMP.

There are two more reserved planes: Plane 15 and Plane 16, for private use, where no code point is assigned.

Basic Multilingual Plane

Basic Multilingual Plane (BMP), or Plane 0, is most commonly in general documents. Code points are allocated for common characters in contemporary scripts with exactly the same set as ISO/IEC 10646-1, as summarized in Figure 2 in section ý0 Note that the code points between E000 and F900 are reserved for the vendors' private use. No character is assigned in this area.

Character Encoding

There are several ways of encoding Unicode strings for information interchange. One may simply represent each character using a fixed size integer (called wide char), which is defined by ISO/IEC 10646 as UCS-2 and UCS-4, where 2-byte and 4-byte integers are used, respectively ^[1] and where UCS-2 is for BMP only. But the common practice is to encode the characters using variable-length sequences of integers called UTF-8, UTF-16 and UTF-32 for 8-bit, 16-bit and 32-bit integers, respectively.^[2] There is also UTF-7 for e-mail transmissions that are 7-bit strict, but UTF-8 is safe in most cases.

UTF-32

UTF-32 is the simplest Unicode encoding form. Each Unicode code point is represented directly by a single 32-bit unsigned integer. It is therefore, a fixed-width character encoding form. This makes UTF-32 an ideal form for APIs that pass single character values. However, it is inefficient in terms of storage for Unicode strings.

UTF-16

Figure 2: Unicode Basic Multilingual Plane

UTF-16 encodes code points in the range 0000 to FFFF (i.e. BMP) as a single 16-bit unsigned integer.

Code points in supplementary planes are instead represented as pairs of 16-bit unsigned integers. These pairs of code units are called surrogate pairs. The values used for the surrogate pairs are in the range D800 - DFFF, which are not assigned to any character. So, UTF-16 readers can easily distinguish between single code unit and surrogate pairs. The Unicode Standard ^[3] provides more details of surrogates.

UTF-16 is a good choice for keeping general Unicode strings, as it is optimized for characters in BMP, which is used in 99 percent of Unicode texts. It consumes about half of the storage required by UTF-32.

UTF-8

To meet the requirements of legacy byte-oriented ASCII-based systems, UTF-8 is defined as variablewidth encoding form that preserves ASCII compatibility. It uses one to four 8-bit code units to represent a Unicode character, depending on the code point value. The code points between 0000 and 007F are encoded in a single byte, making any ASCII string a valid UTF-8. Beyond the ASCII range of Unicode, some non-ideographic characters between 0080 and 07FF are encoded with two bytes. Then, Indic scripts and CJK ideographs between 0800 and FFFF are encoded with three bytes. Supplementary characters beyond BMP require four bytes. The Unicode Standard ^[4] provides more detail of UTF-8.

UTF-8 is typically the preferred encoding form for the Internet. The ASCII compatibility helps a lot in migration from old systems. UTF-8 also has the advantage of being byte-serialized and friendly to C or other programming languages APIs. For example, the traditional string collation using byte-wise comparison works with UTF-8.

In short, UTF-8 is the most widely adopted encoding form of Unicode.

Character Properties

In addition to code points, Unicode also provides a database of character properties called the Unicode Character Database (UCD),^[5] which consists of a set of files describing the following properties:

Name.
General category (classification as letters, numbers, symbols, punctuation, etc.).
Other important general characteristics (white space, dash, ideographic, alphabetic, non character, deprecated, etc.).
Character shaping (bidi category, shaping, mirroring, width, etc.).
Case (upper, lower, title, folding; both simple and full).
Numeric values and types (for digits).
Script and block.
Normalization properties (decompositions, decomposition type, canonical combining class, composition exclusions, etc.).
Age (version of the standard in which the code point was first designated).
Boundaries (grapheme cluster, word, line and sentence).
Standardized variants.

The database is useful for Unicode implementation in general. It is available at the Unicode.org Web site. The Unicode Standard ^[6] provides more details of the database.

Technical Reports

In addition to the code points, encoding forms and character properties, Unicode also provides some technical reports that can serve as implementation guidelines. Some of these reports have been included as annexes to the Unicode standard, and some are published individually as Technical Standards.

In Unicode 4.0, the standard annexes are:

UAX 9: The Bidirectional Algorithm: Specifications for the positioning of characters flowing from right to left, such as Arabic or Hebrew.

UAX 11: East-Asian Width: Specifications of an informative property of Unicode characters that is useful when interoperating with East-Asian Legacy character sets.

UAX 14: Line Breaking Properties: Specification of line breaking properties for Unicode characters as well as a model algorithm for determining line break opportunities.

UAX 15: Unicode Normalization Forms: Specifications for four normalized forms of Unicode text. With these forms, equivalent text (canonical or compatibility) will have identical binary representations. When implementations keep strings in a normalized form, they can be assured that equivalent strings have a unique binary representation.

UAX 24: Script Names: Assignment of script names to all Unicode code points. This information is useful in mechanisms such as regular expressions, where it produces much better results than simple matches on block names.

UAX 29: Text Boundaries: Guidelines for determining default boundaries between certain significant text elements: grapheme clusters ("user characters"), words and sentences.

The individual technical standards are:

UTS 6: A Standard Compression Scheme for Unicode: Specifications of a compression scheme for Unicode and sample implementation.

UTS 10: Unicode Collation Algorithm: Specifications for how to compare two Unicode strings while conforming to the requirements of the Unicode Standard. The UCA also supplies the Default Unicode Collation Element Table (DUCET) as the data specifying the default collation order for all Unicode characters.

UTS 18: Unicode Regular Expression Guidelines: Guidelines on how to adapt regular expression engines to use Unicode.

All Unicode Technical Reports are accessible from the Unicode.org web site.^[7]

Fonts

Font Development Tools

Some FOSS tools for developing fonts are available. Although not as many as their proprietary counterparts, they are adequate to get the job done, and are continuously being improved. Some interesting examples are:

XmBDFEd.^[8]: Developed by Mark Leisher, XmBDFEd is a Motif-based tool for developing BDF fonts. It allows one to edit bit-map glyphs of a font, do some simple transformations on the glyphs, transfer information between different fonts, and so on.

FontForge (formerly PfaEdit ^[9] ).: Developed by George Williams, FontForge is a tool for developing outline fonts, including Postscript Type1, TrueType, and OpenType. Scanned images of letters can be imported and their outline vectors automatically traced. The splines can be edited, and transformations like skewing, scaling, rotating, thickening may be applied and much more. It provides sufficient functionalities for editing Type1 and TrueType fonts properties. OpenType tables can also be edited in its recent versions. One weak point, however, is hinting. It guarantees Type1 hints quality, but not for TrueType.

TTX/FontTools ^[10]: Just van Rossum's TTX/FontTools is a tool to convert OpenType and TrueType fonts to and from XML. FontTools is a library for manipulating fonts, written in Python. It supports TrueType, OpenType, AFM and, to a certain extent, Type 1 and some Mac-specific formats. It allows one to dump OpenType tables, examine and edit them with XML or plain text editor, and merge them back to the font.

Font Configuration

There have been several font configuration systems available in GNU/Linux desktops. The most fundamental one is the X Window font system itself. But, due to some recent developments, another font configuration called fontconfig has been developed to serve some specific requirements of modern desktops. These two font configurations will be discussed briefly.

First, however, let us briefly discuss the X Window architecture, to understand font systems. X Window ^[11] is a client-server system. X servers are the agents that provide service to control hardware devices, such as video cards, monitors, keyboards, mice or tablets, as well as passes user input events from the devices to the clients. X clients are GUI application programs that request X server to draw graphical objects on the screen, and accept user inputs via the events fed by X server. Note that with this architecture, X client and server can be on different machines in the network. In which case, X server is the machine that the user operates with, while X client can be a process running on the same machine or on a remote machine in the network.

In this client-server architecture, fonts are provided on the server side. Thus, installing fonts means configuring X server by installing fonts and registering them to its font path.

However, since X server is sometimes used to provide thin-client access in some deployments, where X server may run on cheap PCs booted by floppy or across network, or even from ROM, font installation on each X server is not always appropriate. Thus, font service has been delegated to a separate service called X Font Server (XFS). Another machine in the network can be dedicated for font service so that all X servers can request font information. Therefore, with this structure, an X server may be configured to manage fonts by itself or to use fonts from the font server, or both.

Nevertheless, recent changes in XFree86 have addressed some requirements to manage fonts at the client side. The Xft extension provides anti-aliased glyph images by font information provided by the X client. With this, the Xft extension also provides font management functionality to X clients in its first version. This was later split from Xft2 into a separate library called fontconfig. fontconfig is a font management system independent of X, which means it can also apply to non-GUI applications such as printing services. Modern desktops, including KDE 3 and GNOME 2 have adopted fontconfig as their font management systems, and have benefited from closer integration in providing easy font installation process. Moreover, client-side fonts also allow applications to do all glyph manipulations, such as making special effects, while enjoying consistent appearance on the screen and in printed outputs.

The splitting of the X client-server architecture is not standard practice on stand-alone desktops. However, it is important to always keep the split in mind, to enable particular features.

Output Methods

Since the usefulness of XOM is still being questioned, we shall discuss only the output methods already implemented in the two major toolkits: Pango of GTK+ and Qt.

Pango Text Layout Engines

Pango ['Pan' means 'all' in English and 'go' means 'language' in Japanese] ^[12] is a multilingual text layout engine designed for quality text typesetting. Although it is the text drawing engine of GTK+, it can also be used outside GTK+ for other purposes, such as printing.^[13] This section will provide localizers with a bird's eye view of Pango. The Pango reference manual ^[14] should be consulted for more detail.

PangoLayout

At a high level, Pango provides the PangoLayout class that takes care of typesetting text in a column of given width, as well as other information necessary for editing, such as cursor positions. Its features may be summarized as follows:

Paragraph Properties

ident
spacing
alignment
justification
word/character wrapping modes
tabs

Text Elements

get lines and their extents
get runs and their extents
character search at (x, y) position
character logical attributes (is line break, is cursor position, etc.)
cursor movements

Text Contents

plain text
markup text

Middle-level Processing

Pango also provides access to some middle-level text processing functions, although most clients in general do not use them directly. To gain a brief understanding of Pango internals, some highlights are discussed here.

There are three major steps for text processing in Pango:^[15]

Itemize. Breaks input text into chunks (items) of consistent direction and shaping engine. This usually means chunks of text of the same language with the same font. Corresponding shaping and language engines are also associated with the items.
Break. Determines possible line, word and character breaks within the given text item. It calls the language engine of the item (or the default engine based on Unicode data if no language engine exists) to analyze the logical attributes of the characters (is-line-break, is-char-break, etc.).
Shape. Converts the text item into glyphs, with proper positioning. It calls the shaping engine of the item (or the default shaping engine that is currently suitable for European languages) to obtain a glyph string that provides the information required to render the glyphs (code point, width, offsets, etc.).

Pango Engines

Pango engines are implemented in loadable modules that provide entry functions for querying and creating the desired engine. During initialization, Pango queries the list of all engines installed in the memory. Then, when it itemizes input text, it also searches the list for the language and shaping engines available for the script of each item and creates them for association to the relevant text item.

Pango Language Engines

As discussed above, the Pango language engine is called to determine possible break positions in a text item of a certain language. It provides a method to analyze the logical attributes of every character in the text as listed in Table 3.

Table 3 Pango Logical Attributes
Flag	Description
is_line_break	can break line in front of the character
is_mandatory_break	must break line in front of the character
is_char_break	can break here when doing character wrap
is_white	is white space character
is_cursor_position	cursor can appear in front of character
is_word_start	is first character in a word
is_word_end	is first non-word character after a word
is_sentence_boundary	is inter-sentence space
is_sentence_start	is first character in a sentence
is_sentence_end	is first non-sentence character after a sentence
backspace_deletes_character	backspace deletes one character, not entire cluster (new in Pango 1.3.x)

Pango Shaping Engines

As discussed above, the Pango shaping engine converts characters in a text item in a certain language into glyphs, and positions them according to the script constraints. It provides a method to convert a given text string into a sequence of glyphs information (glyph code, width and positioning) and a logical map that maps the glyphs back to character positions in the original text. With all the information provided, the text can be properly rendered on output devices, as well as accessed by the cursor despite the difference between logical and rendering order in some scripts like Indic, Hebrew and Arabic.

Qt Text Layout

Qt 3 text rendering is different from that of GTK+/Pango. Instead of modularizing, it handles all complex text rendering in a single class, called QComplexText, which is mostly based on the Unicode character database. This is equivalent to the default routines provided by Pango. Due to the incompleteness of the Unicode database, this class sometimes needs extra workarounds to override some values. Developers should examine this class if a script is not rendered properly.

Although relying on the Unicode database appears to be a straightforward method for rendering Unicode texts, this makes the class rigid and error prone. Checking the Qt Web site regularly to find out whether there are bugs in latest versions is advisable. However, a big change has been planned for Qt 4, which is the Scribe text layout engine, similar to Pango for GTK+.

Input Methods

The needs of keyboard maps and input methods have been discussed on page 37. This section will further discuss how to implement them, beginning with keyboard layouts. Pages 37–38 also mentions that XIM is the current basic input method framework for X Window. Only Qt 3 relies on it, while GTK+ 2 defines its own input method framework. Both XIM and GTK+ IM are discussed here.

Keyboard Layouts

The first step to providing text input for a particular language is to prepare the keyboard map. X Window handles the keyboard map using the X Keyboard (XKB) extension. When you start an X server on GNU/Linux, a virtual terminal is attached to it in raw mode, so that keyboard events are sent from the kernel without any translation.

The raw scan code of the key is then translated into keycode according to the keyboard model. For XFree86 on PC, the keycode map is usually "xfree86" as kept under /etc/X11/xkb/keycodes directory. The keycodes just represent the key positions in symbolic form, for further referencing.

The keycode is then translated into a keyboard symbol (keysym) according to the specified layout, such as qwerty, dvorak, or a layout for a specific language, chosen from the data under /etc/X11/xkb/symbols directory. A keysym does not represent a character yet. It requires an input method to translate sequences of key events into characters, which will be described later. For XFree86, all of the above setup is done via the setxkbmap command. (Setting up values in /etc/X11/XF86Config means setting parameters for setxkbmap at initial X server startup.) There are many ways of describing the configuration, as explained in Ivan Pascal's XKB explanation.^[16] The default method for XFree86 4.x is the "xfree86" rule (XKB rules are kept under /etc/X11/xkb/rules ), with additional parameters:

model - pc104, pc105, microsoft, microsoftplus, ...
mlayout - us, dk, ja, lo, th, ...
(For XFree86 4.0+, up to 64 groups can be provided as part of layout definition)
variant - (mostly for Latins) nodeadkeys
option- group switching key, swap caps, LED indicator, etc.
(See /etc/X11/xkb/rules/xfree86 for all available options.)

For example:

$ setxkbmap us,th -option grp:alt_shift_toggle,grp_led:scroll

Sets layout using US symbols as the first group, and Thai symbols as the second group. The Alt-Shift combination is used to toggle between the two groups. Scroll Lock LED will be the group indicator, which will be on when the current group is not the first group, that is, on for Thai, off for US.

You can even mix more than two languages:

$ setxkbmap us,th,lo -option grp:alt_shift_toggle,grp_led:scroll

This loads trilingual layout. Alt-Shift is used to rotate among the three groups; that is, Alt-RightShift chooses the next group and Alt-LeftShift chooses the previous group. Scroll Lock LED will be on when the Thai or Lao group is active.

The arguments for setxkbmap can be specified in /etc/X11/XF86Config for initialization on X server startup by describing the "InputDevice" section for keyboard, for example:

Section "InputDevice"
    Identifier "Generic Keyboard"
    Driver "keyboard"
    Option "CoreKeyboard"
    Option "XkbRules" "xfree86"
    Option "XkbModel" "microsoftplus"
    Option "XkbLayout" "us,th_tis"
    Option "XkbOptions grp:alt_shift_toggle,lv3:switch,grp_led:scroll"
EndSection

Notice the last four option lines. They tell setxkbmap to use "xfree86" rule, with "microsoftplus" model (with Internet keys), mixed layout of US and Thai TIS-820.2538, and some more options for group toggle key and LED indicator. The "lv3:switch" option is only for keyboard layouts that require a 3rd level of shift (that is, one more than the normal shift keys). In this case for "th_tis" in XFree86 4.4.0, this option sets RightCtrl as 3rd level of shift.

Providing a Keyboard Map

If the keyboard map for a language is not available, one needs to prepare a new one. In XKB terms, one needs to prepare a symbols map, associating keysyms to the available keycodes.

The quickest way to start is to read the available symbols files under the /etc/X11/xkb/symbols directory. In particular, the files used by default rules of XFree86 4.3.0 are under the pc/ subdirectory. Here, only one group is defined per file, unlike the old files in its parent directory, in which groups are pre-combined. This is because XFree86 4.3.0 provides a flexible method for mixing keyboard layouts.

Therefore, unless you need to support the old versions of XFree86, all you need to do is to prepare a single-group symbols file under the pc/ subdirectory.

Here is an excerpt from the th_tis symbols file:

partial default alphanumeric_keys
xkb_symbols "basic" {
    name[Group1]= "Thai (TIS-820.2538)";
    // The Thai layout defines a second keyboard group and changes
    // the behavior of a few modifier keys.
    key <TLDE> { [ 0x1000e4f, 0x1000e5b ] };
    key <AE01> { [ Thai_baht, Thai_lakkhangyao] };
    key <AE02> { [ slash, Thai_leknung ] };
    key <AE03> { [ minus, Thai_leksong ] };
    key <AE04> { [ Thai_phosamphao, Thai_leksam ] };
    ...
};

Each element in the xkb_symbols data, except the first one, is the association of keysyms to the keycode for unshift and shift versions, respectively. Here, some keysyms are predefined in Xlib. You can find the complete list in <X11/keysymdef.h>. If the keysyms for a language are not defined there, the Unicode keysyms, can be used, as shown in the <TLDE> key entry. (In fact, this may be a more effective way for adding new keysyms.) The Unicode value must be prefixed with "0x100" to describe the keysym for a single character.

For more details of the file format, see Ivan Pascal's XKB explanation.^[17] When finished, the symbols.dir file should be regenerated so that the symbols file is listed:

# cd /etc/X11/xkb/symbols
# xkbcomp -lhlpR '*' -o ../symbols.dir

Then, the new layout may be tested as described in the previous section.

Additionally, entries may be added to /etc/X11/xkbcomp/rules/xfree86.lst so that some GUI keyboard configuration tools can see the layout.

Once the new keyboard map is completed, it may also be included in XFree86 source where the data for XKB are kept under the xc/programs/xkbcomp subdirectory.

XIM - X Input Method

For some languages, text input is as straightforward as one-to-one mapping from keysyms to characters, such as English. For European languages, this is a little more complicated because of accents. But for Chinese, Japanese and Korean (CJK), the one-to-one mapping is impossible. They require a series of keystroke interpretations to obtain each character.

X Input Method (XIM) is a locale-based framework designed to address the requirements of text input for any language. It is a separate service for handling input events as requested by X clients. Any text entry in X clients is represented by X Input Context (XIC). All the keyboard events will be propagated to the XIM, which determines the appropriate action for the events based on the current state of the XIC, and passes back the resulting characters.

Internally, a common process of every XIM is to translate keyboard scan code into keycode and then to keysym, by calling XKB, whose process detail has been described in previous sections. The following processes to convert keysyms into characters are different for different locales.

In general cases, XIM is usually implemented using the client-server model. More detailed discussion of XIM implementation is beyond the scope of this document. Please see Section 13.5 of the Xlib document ^[18] and the XIM protocol ^[19] for more information.

In general, users can choose their favourite XIM server by setting the system environment XMODIFIERS, like this:

$ export LANG=th_TH.TIS-620
$ export XMODIFIERS="@im=Strict"

This specifies Strict input method for Thai locale.

GTK+ IM

As a cross-platform toolkit, GTK+ 2 defines its own framework using pure GTK+ APIs, instead of relying on the input methods of each operating system. This provides high-level of abstraction, making input methods development a lot easier than writing XIM servers. In any case, GTK+ can still use the several existing XIM servers through the imxim bridging module. Besides, the input methods developed become immediately available to GTK+ in all platforms it supports, including XFree86, Windows, and GNU/Linux framebuffer console. The only drawback is that the input methods cannot be shared with non-GTK+ applications.

Client Side

A normal GTK+-based text entry widget will provide an "Input Methods" context menu that can be opened by right clicking within the text area. This menu provides the list of all installed GTK+ IM modules, which the user can choose from. The menu is initialized by querying all installed modules for the engines they provide.

From the client's point of view, each text entry is represented by an IM context, which communicates with the IM module after every key press event by calling a key filter function provided by the module. This allows the IM to intercept the key presses and translate them into characters. Non-character keys, such as function keys or control keys, are not usually intercepted. This allows the client to handle special keys, such as shortcuts.

There are also interfaces for the other direction. The IM can also call the client for some actions by emitting GLib signals, for which the handlers may be provided by the client by connecting callbacks to the signals:

"preedit_changed": Uncommitted (pre-edit) string is changed. The client may update the display, but not the input buffer, to let the user see the keystrokes.

"commit": Some characters are committed from the IM. The committed string is also passed so that the client can take it into its input buffer.

"retrieve_surrounding": The IM wants to retrieve some text around the cursor.

"delete_surrounding": The IM wants to delete the text around the cursor. The client should delete the text portion around the cursor as requested.

IM Modules

GTK+ input methods are implemented using loadable modules that provide entry functions for querying and creating the desired IM context. These are used as interface with the "Input Methods" context menu in text entry areas.

The IM module defines a new IM context class or classes and provides filter functions to be called by the client upon key press events. It can determine proper action to the key and return TRUE if it means to intercept the event or FALSE to pass the event back to the client.

Some IM (e.g., CJK and European) may do a stateful conversion which is incrementally matching the input string with predefined patterns until each unique pattern is matched before committing the converted string. During the partial matching, the IM emits the "preedit_changed" signal to the client for every change, so that it can update the pre-edit string to the display. Finally, to commit characters, the IM emits the "commit" signal, along with the converted string as the argument, to the IM context. Some IM (e.g., Thai) is context-sensitive. It needs to retrieve text around the cursor to determine the appropriate action. This can be done through the "retrieve_surrounding" signal.

In addition, the IM may request to delete some text from the client's input buffer as required by Thai advanced IM. This is also used to correct the illegal sequences. This can be done via the "delete_surrounding" signal.

Locales

As mentioned in earlier, the GNU C library is internationalized according to POSIX and ISO/IEC 14652. Both locales are discussed in this section.

Locale Naming

A locale is described by its language, country and character set. The naming convention as given in OpenI18N guideline ^[20] is:

lang_territory.codeset[@modifiers]

where

lang is a two-letter language code defined in ISO 639:1988. Three-letter codes in ISO 639-2 are also allowed in the absence of the two-letter version. The ISO 639-2 Registration Authority at Library of Congress ^[21] has a complete list of language codes.
territory is a two-letter country code defined in ISO 3166-1:1997. The list of two-letter country codes is available online from ISO 3166 Maintenance agency.^[22]
codeset describes the character set used in the locale.
modifiers add more information for the locale by setting options (turn on flags or use equal sign to set values). Options are separated by commas. This part is optional and implementationdependent. Different I18N frameworks provide different options.

For example

fr_CA.ISO-8859-1= French language in Canada using ISO-8859-1 character set
th_TH.TIS-620 = Thai language in Thailand using TIS-620 encoding

If territory or codeset is omitted, default values are usually resolved by means of locale aliasing.

Note that for the GNU/Linux desktop, the modifiers part is not supported yet. Locale modifiers for X Window are to be set through the XMODIFIERS environment instead.

Character Sets

Character set is part of locale definition. It defines all characters in a character set as well as how they are encoded for information interchange. In the GNU C library (glibc), locales are described in terms of Unicode.

A new character set is described as a Unicode subset, with each element associated by a byte string to be encoded in the target character set. For example, the UTF-8 encoding is described like this:

...
<U0041> /x41         LATIN CAPITAL LETTER A
<U0042> /x42         LATIN CAPITAL LETTER B
<U0043> /x43         LATIN CAPITAL LETTER C
...
<U0E01> /xe0/xb8/x81 THAI CHARACTER KO KAI
<U0E02> /xe0/xb8/x82 THAI CHARACTER KHO KHAI
<U0E03> /xe0/xb8/x83 THAI CHARACTER KHO KHUAT
...

The first column is the Unicode value. The second is the encoded byte string. And the rest are comments.

As another example, TIS-620 encoding for Thai is simple 8-bit single-byte. The first half of the code table is the same as ASCII, and the second half begins encoding the first character at 0xA1. Therefore, the character map looks like:

...
<U0041> /x41         LATIN CAPITAL LETTER A
<U0042> /x42         LATIN CAPITAL LETTER B
<U0043> /x43         LATIN CAPITAL LETTER C
...
<U0E01> /xa1         THAI CHARACTER KO KAI
<U0E02> /xa2         THAI CHARACTER KHO KHAI
<U0E03> /xa3         THAI CHARACTER KHO KHUAT
...

POSIX Locales

According to POSIX, standard C library functions are internationalized according to the following categories:

Category	Description
LC_CTYPE	character classification
LC_COLLATE	string collation
LC_TIME	date and time format
LC_NUMERIC	number format
LC_MONETARY	currency format
LC_MESSAGES	messages in locale language

Setting Locale

A C application can set current locale with the setlocale() function (declared in <locale.h>). The first argument indicates the category to be set; alternatively, LC_ALL is used to set all categories. The second argument is the locale name to be chosen, or alternatively empty string ("") is used to rely on system environment setting.

Therefore, the program initialization of a typical internationalized C program may appear as follows:

#include <locale.h>
...
const char *prev_locale;
prev_locale = setlocale (LC_ALL, "");

and the system environments are looked up to determine the appropriate locale as follows:

If LC_ALL is defined, it shall be used as the locale name.
Otherwise, if corresponding values of LC_CTYPE, LC_COLLATE, LC_MESSAGES are defined, they shall be used as locale names for corresponding categories.
For categories that are still undefined by the above checks, and LANG is defined, this is used as the locale name.
For categories that are still undefined by the above checks, "C" (or "POSIX") locale shall be used.

The "C" or "POSIX" locale is a dummy locale in which all behaviours are C defaults (e.g. ASCII sort for LC_COLLATE).

LC_CTYPE

LC_CTYPE defines character classification for functions declared in <ctype.h>:

iscntl()
isspace()
isalpha()
islower()
toupper()
isgraph()
ispunct()
isdigit()
isupper()
isprint()
isalnum()
isxdigit()
tolower()

Since glibc is Unicode-based, and all character sets are defined as Unicode subsets, it makes no sense to redefine character properties in each locale. Typically, the LC_CTYPE category in most locale definitions refers to the default definition (called "i18n").

LC_COLLATE

C functions that are affected by LC_COLLATE are strcoll() and strxfrm().

strcoll() compares two strings in a similar manner as strcmp() but in a locale-dependent way. Note that the behaviour strcmp() never changes under different locales.
strxfrm() translates string into a form that can be compared using the plain strcmp() to get the same result as when directly compared with strcoll().

The LC_COLLATE specification is the most complicated of all locale categories. There is a separate standard for collating Unicode strings, called ISO/IEC 14651 International String Ordering.^[23] The glibc default locale definition is based on this standard. Locale developers may consider investigating the Common Tailorable Template (CTT) defined there before beginning their own locale definition.

In the CTT, collation is done through multiple passes. Character weights are defined in multiple levels (four levels for ISO/IEC 14651). Some characters can be ignored (by using "IGNORE" as weight) at first passes and be brought into consideration in later passes for finer adjustment. Please see ISO/IEC 14651 document for more details.

LC_TIME

LC_TIME allows localization of date/time strings formatted by the strftime() function. Days of week and months can be translated into the locale language, appropriate date.

LC_NUMERIC & LC_MONETARY

Each culture uses different conventions for writing numbers, namely, the decimal point, the thousand separator and grouping. This is covered by LC_NUMERIC.

LC_MONETARY defines currency symbols used in the locale as per ISO 4217, as well as the format in which monetary amounts are written. A single function localeconv() in <locale.h> is defined for retrieving information from both locale categories. Glibc provides an extra function strfmon() in <monetary.h> for formatting monetary amounts as per LC_MONETARY, but this is not standard C function.

LC_MESSAGES

LC_MESSAGES is mostly used for message translation purposes. The only use in POSIX locale is the description of a yes/no answer for the locale.

ISO/IEC 14652

The ISO/IEC 14652 Specification method for cultural conventions is basically an extended POSIX locale specification. In addition to the details in each of the six categories, it introduces six more:

Category	Description
LC_PAPER	paper size
LC_NAME	personal name format
LC_ADDRESS	address format
LC_TELEPHONE	telephone number
LC_MEASUREMENT	measurement units
LC_VERSION	locale version

All of the above categories have already been supported by glibc. C applications can retrieve all locale information using the nl_langinfo() function.

Building Locales

To build a locale, a locale definition file describing data for ISO/IEC 14652 locale categories must be prepared. (See the standard document for the file format.) In addition, when defining a new character set, a charmap file must be created for it; this gives every character a symbolic name and describes encoded byte strings.

In general, glibc uses UCS symbolic names (<Uxxxx>) in locale definition, for convenience in generating locale data for any charmap. The actual locale data to be used by C programs is in binary form. The locale definition must be compiled with the localedef command, which accepts arguments like this:

localedef [-f <charmap>] [-i <input>] <name>

For example, to build th_TH locale from locale definition file th_TH using TIS-620 charmap:

# localedef -f TIS-620 -i th_TH th_TH

The charmap file may be installed at /usr/share/i18n/charmaps directory, and the locale definition file at /usr/share/i18n/locales directory, for further reference.

The locale command can be used with "-a" option to check for all installed locales and "-m option to list supported charmaps. Issuing the command without argument shows the locale categories selected by environment setting.

Translation

The translation framework most commonly used in FOSS is GNU gettext, although some cross-platform FOSS, such as AbiWord, Mozilla and OpenOffice.org use their own frameworks as a result of the crossplatform abstractions. In this section, the GNU gettext, which covers more than 90 percent of GNU/Linux desktops, is discussed briefly. The concepts discussed here, however, apply to other frameworks.

Messages in program source code are put in a short macro that calls a gettext function to retrieve the translated version. At program initialization, the hashed message database corresponding to LC_MESSAGES locale category is loaded. Then, all messages covered by the macros are translated by quick lookup during program execution. Therefore, the task of translation is to build the message translation database for a particular language and get it installed in an appropriate place for the locale. With that preparation, the gettext programs are automatically translated as per locale setting without having to touch the source code.

GNU gettext also provides tools for creating the message database. Two kinds of files are involved in the process:

PO (Portability Object) file.: This is a file in human-readable form for the translators to work with. It is named so because of its plain-text nature, which makes it portable to other platforms.
MO (Machine Object) file.: This is a hashed database for machines to read. It is in the final format to be loaded by the gettext program. There are many translation frameworks in commercial Unices, and these MO files are not compatible. One may also find some GMO files as immediate output from GNU gettext tools. They are MO files containing some GNU gettext enhanced features.

Important GNU gettext tools will be discussed by describing the summarized steps of translation from scratch (See Figure 3):

Figure 3: GNU gettext Working process

Extract messages with the xgettext utility. What you get is the " package.pot" file as a template for the PO file.
Create the PO file for your language from the template, either by copying it to "xx.po" (where xx is your locale language) and filling its header information with your information, or by using the msginit utility.
Translate the messages by editing the PO file with your favourite text editor. Some specialized editors for PO files, such as kbabel and gtranslator, are also available.
Convert the PO file into MO file using the msgfmt utility.
Install the MO file under the LC_MESSAGES directory of your locale.
When the program develops, new strings are introduced. You need not begin from scratch again. Rather, you extract the new PO template with the xgettext utility as usual, and then merge the template with your current PO with the msgmerge utility. Then, you can continue by translating the new messages.

GNOME intltool

GNU/Linux desktops have more things to translate than messages in C/C++ source code. The system menu entries, lists of sounds on events, for example, also contain messages, mostly in XML formats that are not supported by GNU gettext. One may dig into these individual files to translate the messages, but this is very inconvenient to maintain and is also error prone.

KDE has a strong policy for translation. PO files for all KDE core applications are extracted into a single directory for each language, so that translators can work in a single place to translate the desktop without a copy of the source code. But in practice, one needs to look into the sources occasionally to verify the exact meaning of some messages, especially error messages. This already includes all the messages outside the C++ sources mentioned above.

GNOME comes up with a different approach. The PO files are still placed in the source under the "po" subdirectory as usual. But instead of directly using xgettext to extract messages from the source, the GNOME project has developed an automatic tool called intltool. This tool extracts messages from the XML files into the PO template along with the usual things xgettext does, and merges the translations back as well. As a result, despite the heterogeneous translation system, what translators need to do is still edit a single PO file for a particular language.

The use of intltool is easy. To generate a PO template, change the directory to the " po" subdirectory and run:

$ intltool-update—pot

To generate a new PO file and merge with existing translation:

$ intltool-update xx

where xx is the language code. That is all that is required. Editing the PO file as usual can then begin.

When PO editing is complete, the usual installation process of typical GNOME sources will automatically call the appropriate intltool command to merge the translations back into those XML files before installing. Note that, with this automated system, one should not directly call the xgettext and msgmerge commands any more.

The following sites and documents provide more information on KDE and GNOME translation:

KDE Internationalization Home ( http://i18n.kde.org )
- The KDE Translation HOWTO ( http://i18n.kde.org/translation-howto/ )
The GNOME Translation Project ( http://developer.gnome.org/projects/gtp/ )
- Localizing GNOME Applications ( http://developer.gnome.org/projects/gtp/l10n-guide/ )
- How to Use GNOME CVS as a Translator ( http://developer.gnome.org/doc/tutorials/gnome-i18n/translator.html )

PO Editors

A PO file is a plain text file. This can be edited, using a favourite text editor. But, as stated earlier, translation is a labour-intensive task. It is worth considering some convenient tools to speed up the job.

Normally, the editor is needed to be able to edit UTF-8, as both KDE and GNOME now have used it as standard text encoding. However, the following tools have many other features.

KBabel

Part of the KDE Software Development Kit, KBabel is an advanced and easy-to-use PO-files editor with full navigation and editing capabilities, syntax checking and statistics. The editor separates translated, un-translated and fuzzy messages so that it is easy to find and edit the unfinished parts.

KBabel also provides CatalogManager, which allows keeping track of many PO-files at once, and KBabelDict for keeping the glossary, which is important for translation consistency, especially among team members from different backgrounds.

Gtranslator

Gtranslator is the PO-file editor for the GNOME desktop. It is very similar to Kbabel in core functionality.

Gtranslator also supports auto-translation, where translations are learnt and transferred into its memory, and can be applied in later translations using a hot key.

Footnotes

↑ UCS is the acronym for Universal multi-octet coded Character Set
↑ UTF is the acronym for Unicode (UCS) Transformation Format
↑ The Unicode Consortium. The Unicode Standard, Version 4.0., pp. 76-77.
↑ The Unicode Consortium. The Unicode Standard, Version 4.0., pp. 77-78.
↑ Ibid., pp. 95-104.
↑ Unicode.org, 'Unicode Technical Reports'; available from http://www.unicode.org/reports/index.html .
↑ Unicode.org, 'Unicode Technical Reports'; available from http://www.unicode.org/reports/index.html .
↑ Leisher, M., 'The XmBDFEd Font Editor'; available from crl.nmsu.edu/~mleisher/xmbdfed.html
↑ Williams, G., ' PfaEdit'; available from http://pfaedit.sourceforge.net .
↑ Just van Rossum, J., S ' TTX/FontTools'; available from http://fonttools.sourceforge.net/ .
↑ Note the difference with Microsoft's " Windows" trademark. X Window is without 's'.
↑ Taylor, O., 'Pango'; available from http://www.pango.org .
↑ Taylor, O., 'Pango - Design'; available from http://www.pango.org/design.shtml .
↑ GNOME Development Site, 'Pango Reference Manual'; available from http://developer.gnome.org/doc/API/2.0/pango/ .
↑ This is a very rough classification. Obviously, there are further steps, such as line breaking, alignment and justification. They need not be discussed here, as they go beyond localization.
↑ Pascal, I., X Keyboard Extension; available from http://pascal.tsu.ru/en/xkb/ .
↑ Pascal, I., X Keyboard Extension; available from http://pascal.tsu.ru/en/xkb/
↑ Gettys, J., Scheifler, R.W., Xlib-C Language X Interface, X Consortium Standard, X Verision 11 Release 6.4.
↑ Narita, M., Hiura, H., The Input Method Protocol Version 1.0. X Consortium Standard, X Version 11 Release 6.4.
↑ OpenI18N.org. OpenI18N Locale Name Guideline, Version 1.1 - 2003-03-11; available from http://www.openi18n.org/docs/text/LocNameGuide-V11.txt
↑ Library of Congress, ISO 639-2 Registration Authority; available from http://lcweb.loc.gov/standards/iso639-2.
↑ ISO, ISO 3166 Maintenance agency (ISO 3166/MA) - ISO's focal point for country codes; available from http://www.iso.org/iso/en/prodsservices/iso3166ma/index.html.
↑ ISO/IEC 14651 International String Ordering. ISO/IEC, ISO/IEC JTC1/SC22/WG20 - Internationalization; available from http://anubis.dkuug.dk/jtc1/sc22/wg20 .

[1] UCS is the acronym for Universal multi-octet coded Character Set

[2] UTF is the acronym for Unicode (UCS) Transformation Format

[3] The Unicode Consortium. The Unicode Standard, Version 4.0., pp. 76-77.

[4] The Unicode Consortium. The Unicode Standard, Version 4.0., pp. 77-78.

[5] Ibid., pp. 95-104.

[6] Unicode.org, 'Unicode Technical Reports'; available from http://www.unicode.org/reports/index.html .

[7] Unicode.org, 'Unicode Technical Reports'; available from http://www.unicode.org/reports/index.html .

[8] Leisher, M., 'The XmBDFEd Font Editor'; available from crl.nmsu.edu/~mleisher/xmbdfed.html

[9] Williams, G., ' PfaEdit'; available from http://pfaedit.sourceforge.net .

[10] Just van Rossum, J., S ' TTX/FontTools'; available from http://fonttools.sourceforge.net/ .

[11] Note the difference with Microsoft's " Windows" trademark. X Window is without 's'.

[12] Taylor, O., 'Pango'; available from http://www.pango.org .

[13] Taylor, O., 'Pango - Design'; available from http://www.pango.org/design.shtml .

[14] GNOME Development Site, 'Pango Reference Manual'; available from http://developer.gnome.org/doc/API/2.0/pango/ .

[15] This is a very rough classification. Obviously, there are further steps, such as line breaking, alignment and justification. They need not be discussed here, as they go beyond localization.

[16] Pascal, I., X Keyboard Extension; available from http://pascal.tsu.ru/en/xkb/ .

[17] Pascal, I., X Keyboard Extension; available from http://pascal.tsu.ru/en/xkb/

[18] Gettys, J., Scheifler, R.W., Xlib-C Language X Interface, X Consortium Standard, X Verision 11 Release 6.4.

[19] Narita, M., Hiura, H., The Input Method Protocol Version 1.0. X Consortium Standard, X Version 11 Release 6.4.

[20] OpenI18N.org. OpenI18N Locale Name Guideline, Version 1.1 - 2003-03-11; available from http://www.openi18n.org/docs/text/LocNameGuide-V11.txt

[21] Library of Congress, ISO 639-2 Registration Authority; available from http://lcweb.loc.gov/standards/iso639-2.

[22] ISO, ISO 3166 Maintenance agency (ISO 3166/MA) - ISO's focal point for country codes; available from http://www.iso.org/iso/en/prodsservices/iso3166ma/index.html.

[23] ISO/IEC 14651 International String Ordering. ISO/IEC, ISO/IEC JTC1/SC22/WG20 - Internationalization; available from http://anubis.dkuug.dk/jtc1/sc22/wg20 .

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]