C++ Programming/Software Internationalization/Text Encoding
Text encoding
editText, in particular the characters are used to generate readable text consists on the use of a character encoding scheme that pairs a sequence of characters from a given character set (sometimes referred to as code page) with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the use of its digital representation.
A easy to understand example would be Morse code, which encodes letters of the Latin alphabet as series of long and short depressions of a telegraph key; this is similar to how ASCII, encodes letters, numerals, and other symbols, as integers.
Text and data
editProbably the most important use for a byte is holding a character code. Characters typed at the keyboard, displayed on the screen, and printed on the printer all have numeric values. To allow it to communicate with the rest of the world, the IBM PC uses a variant of the ASCII character set. There are 128 defined codes in the ASCII character set. IBM uses the remaining 128 possible values for extended character codes including European characters, graphic symbols, Greek letters, and math symbols.
In earlier days of computing, the introduction of coded character sets such as ASCII (1963) and EBCDIC (1964) began the process of standardization. The limitations of such sets soon became apparent, and a number of ad-hoc methods developed to extend them. The need to support multiple writing systems (Languages), including the CJK family of East Asian scripts, required support for a far larger number of characters and demanded a systematic approach to character encoding rather than the previous ad hoc approaches.
What's this about UNICODE?
editUnicode is an industry standard whose goal is to provide the means by which text of all forms and languages can be encoded for use by computers. Unicode 6.1 was released in January 2012 and is the current version. It currently comprises over 109,000 characters from 93 scripts. Since Unicode is just a standard that assigns numbers to characters, there also needs to be methods for encoding these numbers as bytes. The three most common character encodings are UTF-8, UTF-16, and UTF-32, of which UTF-8 is by far the most frequently used.
In the Unicode standard, planes are groups of numerical values (code points) that point to specific characters. Unicode code points are logically divided into 17 planes, each with 65,536 (= 216) code points. Planes are identified by the numbers 0 to 16decimal, which corresponds with the possible values 00-10hexadecimal of the first two positions in six position format (hhhhhh). As of version 6.1, six of these planes have assigned code points (characters), and are named.
Plane 0 - Basic Multilingual Plane (BMP)
Plane 1 - Supplementary Multilingual Plane (SMP)
Plane 2 - Supplementary Ideographic Plane (SIP)
Planes 3–13 - Unassigned
Plane 14 - Supplementary Special-purpose Plane (SSP)
Planes 15–16 - Supplementary Private Use Area (S PUA A/B)
BMP and SMP
editBMP | SMP | ||
---|---|---|---|
0000–0FFF | 8000–8FFF | 10000–10FFF | 18000-18FFF |
1000–1FFF | 9000–9FFF | 11000–11FFF | 19000-19FFF |
2000–2FFF | A000–AFFF | 12000–12FFF | 1A000-1AFFF |
3000–3FFF | B000–BFFF | 13000–13FFF | 1B000-1BFFF |
4000–4FFF | C000–CFFF | 14000-14FFF | 1C000-1CFFF |
5000–5FFF | D000–DFFF | 15000-15FFF | 1D000–1DFFF |
6000–6FFF | E000–EFFF | 16000–16FFF | 1E000–1EFFF |
7000–7FFF | F000–FFFF | 17000-17FFF | 1F000–1FFFF |
ISP and SSP
editSIP | SSP | |
---|---|---|
20000–20FFF | 28000–28FFF | E0000–E0FFF |
21000–21FFF | 29000–29FFF | |
22000–22FFF | 2A000–2AFFF | |
23000–23FFF | 2B000–2BFFF | |
24000–24FFF | ||
25000–25FFF | ||
26000–26FFF | ||
27000–27FFF | 2F000–2FFFF |
PUA
editCurrently, about ten percent of the potential space is used. Furthermore, ranges of characters have been tentatively mapped out for every current and ancient writing system (script) the Unicode consortium has been able to identify. While Unicode may eventually need to use another of the spare 11 planes for ideographic characters, other planes remain. Even if previously unknown scripts with tens of thousands of characters are discovered, the limit of 1,114,112 code points is unlikely to be reached in the near future. The Unicode consortium has stated that limit will never be changed.
The odd-looking limit (it is not a power of 2), is not due to UTF-8, which was designed with a limit of 231 code points (32768 planes), and can encode 221 code points (32 planes) even if limited to 4 bytes but is due to the design of UTF-16. In UTF-16 a "surrogate pair" of two 16-bit words is used to encode 220 code points 1 to 16, in addition to the use of single words to encode plane 0.
UTF-8
editUTF-8 is a variable-length encoding of Unicode, using from 1 to 4 bytes for each character. It was designed for compatibility with ASCII, and as such, single-byte values represent the same character in UTF-8 as they do in ASCII. Because a UTF-8 stream doesn't contain '\0's, you may use it directly in your existing C++ code without any porting (except when counting the 'actual' number of character in it).
UTF-16
editUTF-16 is also variable-length, but works in 16 bit units instead of 8, so each character is represented by either 2 or 4 bytes. This means that it is not compatible with ASCII.
UTF-32
editUnlike the previous two encodings, UTF-32 is not variable-length: every character is represented by exactly 32-bits. This makes encoding and decoding easier, because the 4-byte value maps directly to the Unicode code space. The disadvantage is in space efficiency, as each character takes 4 bytes, no matter what it is.