C++ Programming/Software Internationalization/Text Encoding

Text encoding

edit

Text, in particular the characters are used to generate readable text consists on the use of a character encoding scheme that pairs a sequence of characters from a given character set (sometimes referred to as code page) with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the use of its digital representation.

A easy to understand example would be Morse code, which encodes letters of the Latin alphabet as series of long and short depressions of a telegraph key; this is similar to how ASCII, encodes letters, numerals, and other symbols, as integers.

Text and data

edit

Probably the most important use for a byte is holding a character code. Characters typed at the keyboard, displayed on the screen, and printed on the printer all have numeric values. To allow it to communicate with the rest of the world, the IBM PC uses a variant of the ASCII character set. There are 128 defined codes in the ASCII character set. IBM uses the remaining 128 possible values for extended character codes including European characters, graphic symbols, Greek letters, and math symbols.

In earlier days of computing, the introduction of coded character sets such as ASCII (1963) and EBCDIC (1964) began the process of standardization. The limitations of such sets soon became apparent, and a number of ad-hoc methods developed to extend them. The need to support multiple writing systems (Languages), including the CJK family of East Asian scripts, required support for a far larger number of characters and demanded a systematic approach to character encoding rather than the previous ad hoc approaches.

What's this about UNICODE?

edit

Unicode is an industry standard whose goal is to provide the means by which text of all forms and languages can be encoded for use by computers. Unicode 6.1 was released in January 2012 and is the current version. It currently comprises over 109,000 characters from 93 scripts. Since Unicode is just a standard that assigns numbers to characters, there also needs to be methods for encoding these numbers as bytes. The three most common character encodings are UTF-8, UTF-16, and UTF-32, of which UTF-8 is by far the most frequently used.

In the Unicode standard, planes are groups of numerical values (code points) that point to specific characters. Unicode code points are logically divided into 17 planes, each with 65,536 (= 216) code points. Planes are identified by the numbers 0 to 16decimal, which corresponds with the possible values 00-10hexadecimal of the first two positions in six position format (hhhhhh). As of version 6.1, six of these planes have assigned code points (characters), and are named.

Plane 0 - Basic Multilingual Plane (BMP)
Plane 1 - Supplementary Multilingual Plane (SMP)
Plane 2 - Supplementary Ideographic Plane (SIP)
Planes 3–13 - Unassigned
Plane 14 - Supplement­ary Special-purpose Plane (SSP)
Planes 15–16 - Supplement­ary Private Use Area (S PUA A/B)

BMP and SMP

edit
BMP SMP
0000–0FFF 8000–8FFF 10000–10FFF 18000-18FFF
1000–1FFF 9000–9FFF 11000–11FFF 19000-19FFF
2000–2FFF A000–AFFF 12000–12FFF 1A000-1AFFF
3000–3FFF B000–BFFF 13000–13FFF 1B000-1BFFF
4000–4FFF C000–CFFF 14000-14FFF 1C000-1CFFF
5000–5FFF D000–DFFF 15000-15FFF 1D000–1DFFF
6000–6FFF E000–EFFF 16000–16FFF 1E000–1EFFF
7000–7FFF F000–FFFF 17000-17FFF 1F000–1FFFF

ISP and SSP

edit
SIP SSP
20000–20FFF 28000–28FFF E0000–E0FFF
21000–21FFF 29000–29FFF  
22000–22FFF 2A000–2AFFF  
23000–23FFF 2B000–2BFFF  
24000–24FFF    
25000–25FFF    
26000–26FFF    
27000–27FFF 2F000–2FFFF  
PUA
F0000–F0FFF F8000–F8FFF 100000–100FFF 108000–108FFF
F1000–F1FFF F9000–F9FFF 101000–101FFF 109000–109FFF
F2000–F2FFF FA000–FAFFF 102000–102FFF 10A000–10AFFF
F3000–F3FFF FB000–FBFFF 103000–103FFF 10B000–10BFFF
F4000–F4FFF FC000–FCFFF 104000–104FFF 10C000–10CFFF
F5000–F5FFF FD000–FDFFF 105000–105FFF 10D000–10DFFF
F6000–F6FFF FE000–FEFFF 106000–106FFF 10E000–10EFFF
F7000–F7FFF FF000–FFFFF 107000–107FFF 10F000–10FFFF

Currently, about ten percent of the potential space is used. Furthermore, ranges of characters have been tentatively mapped out for every current and ancient writing system (script) the Unicode consortium has been able to identify. While Unicode may eventually need to use another of the spare 11 planes for ideographic characters, other planes remain. Even if previously unknown scripts with tens of thousands of characters are discovered, the limit of 1,114,112 code points is unlikely to be reached in the near future. The Unicode consortium has stated that limit will never be changed.

The odd-looking limit (it is not a power of 2), is not due to UTF-8, which was designed with a limit of 231 code points (32768 planes), and can encode 221 code points (32 planes) even if limited to 4 bytes but is due to the design of UTF-16. In UTF-16 a "surrogate pair" of two 16-bit words is used to encode 220 code points 1 to 16, in addition to the use of single words to encode plane 0.

UTF-8

edit

UTF-8 is a variable-length encoding of Unicode, using from 1 to 4 bytes for each character. It was designed for compatibility with ASCII, and as such, single-byte values represent the same character in UTF-8 as they do in ASCII. Because a UTF-8 stream doesn't contain '\0's, you may use it directly in your existing C++ code without any porting (except when counting the 'actual' number of character in it).

UTF-16

edit

UTF-16 is also variable-length, but works in 16 bit units instead of 8, so each character is represented by either 2 or 4 bytes. This means that it is not compatible with ASCII.

UTF-32

edit

Unlike the previous two encodings, UTF-32 is not variable-length: every character is represented by exactly 32-bits. This makes encoding and decoding easier, because the 4-byte value maps directly to the Unicode code space. The disadvantage is in space efficiency, as each character takes 4 bytes, no matter what it is.