Ruby Programming/Encoding

← ASCII | Introduction to objects →


Encoding Support

edit

With the advent of Ruby 1.9, Ruby now supports encodings other than US-ASCII for strings, IO, and source code. On an installation of Ruby 1.9.3 MRI on Mac OSX Lion, Ruby has the following encodings:

  • ASCII-8BIT
  • UTF-8
  • US-ASCII
  • Big5
  • Big5-HKSCS
  • Big5-UAO
  • CP949
  • Emacs-Mule
  • EUC-JP
  • EUC-KR
  • EUC-TW
  • GB18030
  • GBK
  • ISO-8859-1
  • ISO-8859-2
  • ISO-8859-3
  • ISO-8859-4
  • ISO-8859-5
  • ISO-8859-6
  • ISO-8859-7
  • ISO-8859-8
  • ISO-8859-9
  • ISO-8859-10
  • ISO-8859-11
  • ISO-8859-13
  • ISO-8859-14
  • ISO-8859-15
  • ISO-8859-16
  • KOI8-R
  • KOI8-U
  • Shift_JIS
  • UTF-16BE
  • UTF-16LE
  • UTF-32BE
  • UTF-32LE
  • Windows-1251
  • IBM437
  • IBM737
  • IBM775
  • CP850
  • IBM852
  • CP852
  • IBM855
  • CP855
  • IBM857
  • IBM860
  • IBM861
  • IBM862
  • IBM863
  • IBM864
  • IBM865
  • IBM866
  • IBM869
  • Windows-1258
  • GB1988
  • macCentEuro
  • macCroatian
  • macCyrillic
  • macGreek
  • macIceland
  • macRoman
  • macRomania
  • macThai
  • macTurkish
  • macUkraine
  • CP950
  • CP951
  • stateless-ISO-2022-JP
  • eucJP-ms
  • CP51932
  • GB2312
  • GB12345
  • ISO-2022-JP
  • ISO-2022-JP-2
  • CP50220
  • CP50221
  • Windows-1252
  • Windows-1250
  • Windows-1256
  • Windows-1253
  • Windows-1255
  • Windows-1254
  • TIS-620
  • Windows-874
  • Windows-1257
  • Windows-31J
  • MacJapanese
  • UTF-7
  • UTF8-MAC
  • UTF-16
  • UTF-32
  • UTF8-DoCoMo
  • SJIS-DoCoMo
  • UTF8-KDDI
  • SJIS-KDDI
  • ISO-2022-JP-KDDI
  • stateless-ISO-2022-JP-KDDI
  • UTF8-SoftBank
  • SJIS-SoftBank

Using Encodings

edit

By default, since Ruby 2.0 all Ruby source files are encoded with UTF-8. The usual way of changing the encoding of a file is to use a so called "magic comment". The magic comment must come directly at the beginning of the file, or directly after a shebang comment. The syntax of the magic comment requires only one thing: The comment contains the text coding: followed by the name of the encoding. So the following are all valid magic comments:

#encoding: UTF-8
#coding: UTF-8
#blah blah coding: US-ASCII

The magic comment tells the interpreter that the source code itself, in addition to all of the strings in it, are going to be encoded with the given encoding. So, while this is valid code:

#encoding: ISO-8859-1
puts "Olé!"

This is not:

#encoding: US-ASCII
puts "Olé!"

Because the first snippet declares the files encoding to be ISO-8859-1 (an extension to US-ASCII that adds accented characters for languages such as French and Spanish), the character "é" is valid. However, in US-ASCII, "é" is an invalid character, and will cause an error.

Encodings and Individual Strings

edit

You can also specify encodings for individual strings in your file (although the characters in literals must still be in the encoding declared by the magic comment, or inserted through escape sequences). This is done by two methods of the String class: encode and force_encoding.

encode is used for transcoding. Given, say, the ISO-8859-1 string "Olé!", you could use encode to convert that to UTF-8, which has all of the same characters. However, you can not transcode the ISO-8859-1 string to US-ASCII, unless it contains only ASCII characters (eg. "Hello"). encode has many options and can be configured extensively. See its documentation. Here's the catch of encode: It is very likely that while the visual display and meaning of the characters remain the same, the underlying bytes most likely will not. encode is free to change the underlying bytes of a string. Example:

#encoding: ISO-8859-1
"Olé!".encode("UTF-8") #Valid
"Olé!".encode("US-ASCII") #Error

force_encoding is used to tell Ruby the encoding of a string that already has the correct bytes for that encoding (eg. a UTF-8 string read from a file in an ISO-8859-1 program). force_encoding will never modify the underlying bytes of a string. Example:

#encoding: ISO-8859-1
"\u27d8".force_encoding("UTF-8")

ASCII-8 Bit

edit

Ruby also includes a fake encoding: ASCII-8 Bit, or BINARY. BINARY is the encoding used for binary data.

#encoding: ISO-8859-1