Fundamentals of Data Representation: ASCII and unicode
ASCII
editASCII normally uses 8 bits (1 byte) to store each character. However, the 8th bit is used as a check digit, meaning that only 7 bits are available to store each character. This gives ASCII the ability to store a total of
2^7 = 128 different values.
There is also extended ASCII that uses the 8th bit to store data, allowing for a much larger character set, but for the exam you'll probably be fine with 7 bit parity ASCII |
ASCII values can take many forms:
- Numbers
- Letters (capitals and lower case are separate)
- Punctuation (?/|\£$ etc.)
- non-printing commands (enter, escape, F1)
Take a look at your keyboard and see how many different keys you have. The number should be 104 for a windows keyboard, or 101 for traditional keyboard. With the shift function valus (a, A; b, B etc.) and recognising that some keys have repeated functionality (two shift keys, the num pad). We roughly have 128 functions that a keyboard can perform.
|
|
|
|
If you look carefully at the ASCII representation of each character you might notice some patterns. For example:
Binary | Dec | Hex | Glyph |
---|---|---|---|
110 0001 | 97 | 61 | a |
110 0010 | 98 | 62 | b |
110 0011 | 99 | 63 | c |
As you can see, a = 97, b = 98, c = 99. This means that if we are told what value a character is we can easily work out the value of subsequent or prior characters.
Example: ASCII characters Without looking at the ASCII table above! If we are told that the ASCII value for the character '5' is 011 0101, what is the ASCII value for '8'. We know that '8' is three characters after '5', as 5,6,7,8. This means that the ASCII value of '8' will be three bigger than that for '5': 011 0101 ASCII '5' + 011 -------- 011 1000 ASCII '8' Checking above this is the correct value. If you are worried about making mistakes with binary addition, you can deal with the decimal numbers instead. Take the example where you are given the ASCII value of 'g', 110 0111, what is 'e'? We know that 'e' is two characters before 'g', as e, f, g. This means that the ASCII value of 'e' will be two smaller than that for 'g'. 64 32 16 8 4 2 1 1 1 0 0 1 1 1 = 10310 = ASCII value of 'g' 103 - 2 = 10110 64 32 16 8 4 2 1 1 1 0 0 1 0 1 = 10110 = ASCII value of 'e' |
Exercise: ASCII Without using the crib table (you won't get it in the exam!) answer the following questions: The ASCII code for the letter 'Z' is 90(base10), what is the letter 'X' stored as Answer:
88 - as it is 2 characters down in the alphabet
How many ASCII 'characters' does the following piece of text use: Hello Pete, ASCII rocks! Answer: 27 or 26. If you said 23 you'd be wrong because you must include the non-printing characters at the end of each line. Each end of line needs a EOL command, and a new line needs a carriage return (CR), making the text like so: Hello Pete,[EOL][CR] ASCII rocks![EOL] |
For the Latin alphabet ASCII is generally fine, but what if you wanted to write something in Mandarin, or Hindi? We need another coding scheme!
Extension: Coding ASCII You might have to use ASCII codes when reading from text files. To see what each ASCII code means we can use the folliwing function For x = 0 To 127
Console.WriteLine("ASCII for " & x & " = " & ChrW(x))
Next
Console.ReadLine()
|
Unicode
editThe problem with ASCII is that it only allows you to represent a small number of characters (~128 or 256 for Extended ASCII). This might be OK if you are living in an English speaking country, but what happens if you live in a country that uses a different character set? For example:
- Chinese characters 汉字
- Japanese characters 漢字
- Cyrillic Кири́ллица
- Gujarati ગુજરાતી
- Urdu اردو
You can see that we quickly run into trouble as ASCII can't possibly store these hundreds of thousands of extra characters in just 7 bits. What we use instead is unicode. There are several versions of unicode, each with using a different number of bits to store data:
Name | Descriptions |
---|---|
UTF-8 | 8-bit is the most common unicode format. Characters can take as little as 8-bits, maximizing compatibility with ASCII. But it also allows for variable-width encoding expanding to 16, 24, 32, 40 or 48 bits when dealing with larger sets of characters |
UTF-16 | 16-bit, variable-width encoding, can expand to 32 bits. |
UTF-32 | 32-bit, fixed-width encoding. Each character takes exactly 32-bits |
With over a million possible characters we should be able to store every character from every language on the planet, take a look at these examples:
code point | glyph* | character | UTF-16 code units (hex) |
---|---|---|---|
U+007A | z | LATIN SMALL LETTER Z | 007A |
U+6C34 | 水 | CJK UNIFIED IDEOGRAPH-6C34 (water) | 6C34 |
U+10000 | LINEAR B SYLLABLE B008 A | D800, DC00 | |
U+1D11E | MUSICAL SYMBOL G CLEF | D834, DD1E |
You can find out more about unicode encoding on Wikipedia
Exercise: ASCII and Unicode Without using the crib table (you won't get it in the exam!) answer the following questions: The ASCII code for the letter 'D' is 100 0100, what is the letter 'G' stored as Answer:
100 0111 - as it is 3 characters further on in the alphabet
The ASCII code for the letter 's' is 111 0011, what is the letter 'm' stored as: Answer: 110 1101 - as it is 6 characters down in the alphabet Give a benefit of using ASCII: Answer: Each character only takes up 8 bits, meaning that storing data in ASCII may take up less memory than unicode Give a benefit of using unicode over ASCII: Answer: ASCII stores a much smaller character set than unicode, meaning that you are limited to the Latin character set and cannot represent characters from other languages. How many different characters can 7-bit ASCII represent? Answer: 2^7 = 128 You are designing a computer system for use worldwide, what character encoding scheme should you use and why? Answer: unicode as it would allow you to display non Latin character sets such as Hindi and Cyrillic |