User:Newhoggy~enwikibooks/UTF8 for Haskell

This exercise describes how to convert a UTF8 string into a Unicode string in Haskell.

UTF8 is a superset of ASCII, so the following ASCII string is also a valid UTF8 string. To convert this to a unicode code point, simply take the ASCII ordinal value as the code point.

String H e l l o
ASCII code x48 x65 x49 x49 x6F
Unicode code point U+0048 U+0065 U+0049 U+0049 U+006F

Non-ascii characters span multiple byte and take more work to convert to a unicode code point.

String
ASCII code xE4:xBD:xA0 xE5:xA5:xBD
Unicode code point U+4f60 U+597d

Every non-ASCII unicode character when encoding in UTF8 must consist of a leading byte and one or more trailing bytes. Trailing bytes are of the form 0x10nnnnnn so they are easy to recognise. Leading bytes are of the form 0x110nnnnn, 0x1110nnnn, or 0x1111nnnn, each signifying that there are 1, 2, or 3 trailing bytes respectively.

The UTF8 to unicode translation can be describe thus:

Bytes UTF8 Unicode
2 0x110aaaaa 0x10bbbbbb 0x0000000000aaaaabbbbbb
3 0x1110aaaa 0x10bbbbbb 0x10cccccc 0x00000aaaabbbbbbcccccc
4 0x11110aaa 0x10bbbbbb 0x10cccccc 0x10dddddd 0xaaabbbbbbccccccdddddd

In the case of 2 bytes, if the resulting code point is less than or equal to 127, then the UTF8 character sequence is illegal because codepoints 0 to 127 are already represented in the ASCII subset. Notice that each trailing byte contributes 6 bits to the code point while the leading byte contributes 5, 4, or 3 bits to the code point depending on the type of leading byte. Keep this in mind because we will use bit shifting and recursion on this property for the translation.

The code

edit

First we need to import the ord function from Data.Char and the bitwise operators .&. (bitwise AND), .|. (bitwise OR), and shiftL (bitwise shift left) from Data.Bits.

> import Data.Bits
> import Data.Char

To help document the code we define UTF8 type synonyms for character and string datatypes. The type synonyms will make it clear which function argument, or return type contains what sort of data.

> type Utf8Char = Char
> type Utf8String = String

Here we define a helper function that will be used later on in the code. The function determines if a value is inclusively between two values.

> within :: Ord a => a -> (a, a) -> Bool
> within value (min, max) = (min <= value) && (value <= max)

> consumeUtf8 :: Int -> Int -> Utf8String -> String
> consumeUtf8 ordinal 0 remainder = (chr ordinal):(fromUtf8 remainder)
> consumeUtf8 ordinal n (next:remainder) =
>   consumeUtf8 newOrdinal (n-1) remainder
>   where
>     nextOrdinal = (ord next) .&. (ord '\x3f')
>     newOrdinal = (ordinal `shiftL` 6) .|. nextOrdinal
> consumeUtf8 _ n [] =
>   error "Incomplete character.  Expecting " ++ (show n) ++ " more bytes."

> fromUtf8 :: Utf8String -> String
> fromUtf8 [] = []
> fromUtf8 (lead:remainder)
>   | lead `within` asciiRange = fromUtf8 remainder
>   | lead `within` plus1Range = consumeUtf8 (ordinal plus1Mask) 1 remainder
>   | lead `within` plus2Range = consumeUtf8 (ordinal plus2Mask) 2 remainder
>   | lead `within` plus3Range = consumeUtf8 (ordinal plus3Mask) 3 remainder
>   | otherwise = error $ "Invalid lead byte: " ++ (show lead)
>   where
>     asciiRange = ('\x00', '\x7f')
>     badRange00 = ('\x80', '\xbf') -- Only tail bytes in this range
>     badRange01 = ('\xc0', '\xc1') -- Prevent duplicating code points <= 127
>     plus1Range = ('\xc2', '\xdf')
>     plus2Range = ('\xe0', '\xef')
>     plus3Range = ('\xf0', '\xf4')
>     badRange02 = ('\xf5', '\xff') -- Illegal in Unicode, or RFC 3629
>     plus1Mask = '\x1f'
>     plus2Mask = '\x0f'
>     plus3Mask = '\x07'
>     ordinal mask = (ord lead) .&. (ord mask)