Perl Programming/Unicode UTF-8
Overview
editIn the context of application development, Unicode with UTF-8 encoding is the best way to support multiple languages in your application. Multiple languages can even be supported on the same Web page.
Unicode (usually in UTF-8 form) is replacing ASCII and the use of 8-bit "code pages" such as ISO-8859-1 and Windows-1252.
See also Perl Unicode Cookbook - 44 recipes for working with Unicode in Perl 5.
Unicode
editUnicode is a standard that specifies all of the characters for most of the World's writing systems. Each character is assigned a unique codepoint, such as U+0030. The first 256 code points are the same as ISO-8859-1 to make it trivial to convert existing Western/Latin-1 text.
To view properties for a particular codepoint:
use Unicode::UCD 'charinfo';
use Data::Dumper;
print Dumper(charinfo(0x263a)); # U+263a
If you view the Unicode character reference, you will notice that not every codepoint has an assigned character. Also, because of backward compatibility with legacy encodings, some characters have multiple codepoints.
UTF-8
editUTF-8 is a specific encoding of Unicode — the most popular encoding. Other encodings include UTF-7, UTF-16, UTF-32, etc. You will probably want to use UTF-8, if you decide to use Unicode.
An encoding defines how each Unicode codepoint maps to bits and bytes. In UTF-8 encoding, the first 128 Unicode codepoints use one byte. These byte values are the same as US-ASCII, making UTF-8 encoding and ASCII encoding interchangeable if only ASCII characters are used. The next 1,920 codepoints use two-byte encoding in UTF-8. Three or four bytes are needed to encode the remaining codepoints.
Note that although Unicode codepoints 128-255 are the same as ISO-8859-1, UTF-8 encodes each of these codepoints differently. UTF-8 uses two bytes to encode each of these codepoints, whereas ISO-8859-1 only uses one byte for each character in that range. Therefore, ISO-8859-1 and UTF-8 are not interchangeable. (If only ASCII characters are used, then they are all interchangeable, since ASCII, ISO-8859-1, and UTF-8 all share the same encoding for the first 128 Unicode codepoints.)
So, to reiterate, with UTF-8, not all characters are encoded into a single byte (unlike ASCII and ISO-8859-1). Think about that for a moment: how might that affect editors (like vim or emacs), Web pages and forms, databases, Perl itself, Perl IO, your Perl source code (if you want to include a character with a multi-byte encoding)? How might that affect passing strings around, if the strings contain characters with multi-byte encodings? Do regular expressions still work?
Character Encoding | # characters | 128 US-ASCII characters | Next 128 characters | Remaining Characters |
---|---|---|---|---|
US-ASCII | 128 | 1 byte | N/A | N/A |
ISO-8859-1 | 256 | 1 byte | 1 byte | N/A |
UTF-8 | > 100,000 | 1 byte | 2 bytes | 2 - 6 bytes |
As you can see from the table above, codepoints 128-255 (0x80-0xff) are where you need to be careful. Later, you will find out that codepoints 128-159 (0x80-0x9F) are even trickier, due to the fact that the popular Windows-1252 character set (another one-byte-per-character encoding) is incompatible with ISO-8859-1 in this range.
\x{c3}\x{ae}
How much does UTF-8 "cost"?
edit- some functions are slower with UTF-8 encoded strings in Perl
- you have to write some additional Perl code to ensure that data coming into Perl is decoded properly, and that data going out of Perl is encoded properly — but you have to do this anytime you use a character set other than the native 8-bit character set of the platform (that we'll now refer to as N8CS[1]), which is often ISO-8859-1/Latin-1
- you have to interact with your database appropriately -- is it using UTF-8?
- you have to ensure your Web pages specify that pages are encoded in UTF-8
- you may need to make a Web server adjustment (if it is configured to always serve some particular character set, which is not UTF-8)
How do I use UTF-8?
editThe "best practice" approach is to use UTF-8 everywhere, if possible. This includes Web pages and hence Web forms, databases, HTML templates, and strings stored internally in Perl. One exception might be your Perl source code itself. If N8CS is sufficient (i.e., if you don't need any UTF-8 characters or strings in your source code), your source code does not have to be encoded as UTF-8. (Okay, another exception might be your HTML templates. If your templates only require/contain N8CS, they do not have to be encoded as UTF-8 either.)
To properly use UTF-8 in a Perl Web application, here is a summary of what must be done:
- All text (non-binary) data/octets coming into Perl (hence form data, database data, file reading, HTML templates, etc.) must be properly decoded. If the incoming text/octets are UTF-8 encoded, they must be UTF-8 decoded. If they are N8CS (usually ISO-8859-1) encoded, they should be N8CS decoded. If they are encoded with some other character set, they must be decoded with that character set.
- All text data going out of Perl (hence to the browser, database, files, etc.) must be properly encoded (into an octet stream). STDOUT (that goes to the browser) must be UTF-8 encoded.
- the browser needs to be told that Web pages are UTF-8 encoded via an HTTP header and a <meta> tag
Do not use Perl versions prior to 5.8.1. Although support for UTF-8 began with v5.6.0, regular expressions do not work even in the next release, v5.6.1. v5.8.1 added some speed improvements. By Perl 5.14, Unicode support is for the most part clean and smooth.
Before we start getting into the finer details about how to use UTF-8, we need to first define some terms, and then talk a bit about Perl's dual personality when it comes internally storing text.
Terminology
editA character is a logical entity. Characters must be encoded (using a character set) in order to be used, stored, written, exchanged between programs, etc. Encoding turns a logic character into something we can use in a program. Depending on which character set is used for encoding, a single character may require one or more bytes to represent it.
We'll use the term octets when referring to data passing into or out of a Perl program. An octet is a byte, 8 bits. Encoded characters make up an octet stream. When an octet stream comes into Perl, the bytes should be decoded (using the correct character set -- the character set they were encoded with) so that Perl can determine which logical characters are contained in the encoded octet stream. Perl can then store these as strings -- a sequence of characters.
Binary data also comes in as an octet stream. It should not be decoded using a character set, because it likely either doesn't contain any characters, or it contains information in addition to characters, and hence cannot be decoded with a character set.
Perl strings/text
editInternally, Perl stores each string in one of the following encodings:
- native encoding — byte encoding. It uses N8CS[2]. This is a one-byte-per-character encoding, and hence a maximum of only 255 characters can be encoded. This is the default encoding for all incoming text/octets if Perl is not instructed to decode (bad idea). Strings using this encoding are called byte strings or binary strings. Unless you tell it otherwise, Perl will consider these bytes to be in ISO-8859-1, not in your platform encoding. This is a common bug.
- UTF-8 encoding — character encoding. It uses (obviously) UTF-8. Strings using this encoding are called character strings or text strings or Unicode strings.
When creating your own strings, Perl uses N8CS when possible (for backwards compatibility and efficiency reasons). However, if a character can not be represented in N8CS, UTF-8 is used. In other words, if all code points in a string are <= 0xFF, N8CS is used, otherwise UTF-8 is used.
$native_string = "\xf1";
$native_string = "\x{00f1}"; # still N8CS, since <= 0xff
$native_string = chr(0xf1); # still N8CS, since <= 0xff
$utf8_string = "\x{0100}";
You can convert an N8CS string to a UTF-8 string using utf8::upgrade():
$my_string = "\xf1"; # N8CS byte string (one byte is used internally to encode)
utf8::upgrade($my_string); # UTF-8 character string now (two bytes are used internally to encode)
Your program can have a mix of strings in both of Perl's internal formats. Perl uses a "UTF8 flag" to keep track of which encoding a string is internally using. Thankfully, the format/flag follows the string. Perl keeps a string in N8CS as long as possible. However, when a N8CS/native string is used together with a UTF-8 string, the native string is silently implicitly decoded using N8CS, and upgraded (encoded) to UTF-8. In other words, the native byte string gets decoded with the native character set, and then it gets internally encoded into UTF-8. The resulting character string will have the UTF8 flag set.
UTF-8 flow
editAny Perl IO needs to correctly handle decoding and encoding of strings/text. Since there are multiple character encodings in use in the World, Perl can't correctly guess which character encoding was used to encode some particular incoming text/octets, nor can it know which character encoding you want to use for outgoing text/octets. An incoming stream of UTF-8 octets is not the same as, say, an incoming stream of Windows-1252 octets. For example, Unicode character U+201c (left double quotation mark) is encoded in one byte in Windows-1252 (0x93), but UTF-8 encodes it using three octets (0xE2 0x80 0x9C). If you want Perl to interpret your incoming text/octets correctly, you must tell Perl which character set was used to encode them, so they can be decoded properly.
The typical flow of UTF-8 text/octets in to and out of a Perl program is as follows:
- Receive an external UTF-8 encoded text/octet stream and correctly decode it — i.e., tell Perl which character set the octets are encoded in (in this case, the encoding is UTF-8). Perl may check for malformed data (bad encoding) while decoding, depending on which decoding method you select. Perl stores the string internally as N8CS or UTF-8, depending on which decoding method you select, and what characters are found to be in the octet stream. (Normally, the string will be internally stored as UTF-8.)
- Process the string as you normally would.
- Encode the string into a UTF-8 encoded octet stream and output it.
1. Decoding text input
editExternal input includes submitted HTML form data, database data (e.g., from SQL SELECT statements), HTML templates, text files, sockets, other programs, etc. If any of these might contain UTF-8 encoded data/text, you must decode it. UTF-8 decoding in Perl involves two steps:
- Decoding the text according to UTF-8 format rules. This may generate decoding errors, depending on which decoding method you select. Using decode() always results in the string being internally stored as UTF-8, with the UTF8 flag set (despite what the documentation for Encode says). Using utf8::decode() may result in N8CS or UTF-8 internal encoding. If the incoming text only contains ASCII characters, N8CS is used, otherwise UTF-8 is used.
- Encoding the text (this might be a no-op) and storing it internally as N8CS or UTF-8. If it is stored as UTF-8, the UTF8 flag is set.
If you are certain that the incoming data/octets only contains N8CS (that Perl will interpret as ISO-8859-1) text, you do not need to explicitly decode it (because Perl's default internal encoding is N8CS, which is a one-byte-per-character encoding). However, "best practice" suggests that all incoming data/octets should be explicitly decoded — you can explicitly decode ISO-8859-1, ASCII, and a number of other character encodings.
If you don't decode, Perl assumes input text/octets are N8CS encoded, hence each octet is treated as a separate character — clearly, this is not what you want if you have a multi-byte UTF-8 encoded octet stream/text coming in. Improper decoding can lead to double encoding, and this can be difficult to locate due to implicit decoding (discussed above).
Another important point to make here: you need to know which encoding was used for each input text. Do not guess, do not assume.
Input - files, file handles
editPerl can automatically decode data as it comes into Perl using PerlIO layers:
open (my $in_fh, "<:encoding(UTF-8)", $filename) || die; # auto UTF-8 decoding on read
If you already have an open filehandle:
binmode $in2_fh, ':encoding(UTF-8)';
Do not use :encoding(utf8) since it does not check that your incoming text is valid UTF-8, it simply marks it as UTF-8 — see Perlmonks.
If your text file contains a Byte Order Mark, see Perlmonks.
Input - HTML templates
editIf you are using a CGI framework or template engine to pull in UTF-8 encoded HTML template files, you may need to inform it about the UTF-8 encoding, so that it can "UTF-8 decode" the template files as they are read in. Basically, the framework or template engine needs to do what we talked about in the previous section.
For Template::Toolkit, if you use an appropriate Byte Order Mark (BOM) in your template files to indicate the encoding, the toolkit will decode them appropriately, automatically. If the templates do not use BOMs, use the ENCODING option:
my $template = Template->new({ ENCODING => 'utf8' });
HTML::Template currently does not support decoding of UTF-8 encoded HTML template files. This is a known limitation/bug. There are a few workarounds:
- A patch is available.
- You can use TMPL_VARs to insert UTF-8 content into an N8CS (or even ASCII) encoded template file. UTF-8 decode your parameters/content before inserting them into an HTML template using TMPL_VARs, and implicit decoding should upgrade the resulting text (i.e., the template and the filled-in variables) to UTF-8 internally. For many applications, this is often sufficient.
Input - Web forms
editBy default, CGI.pm does not decode your form parameters. You can use the -utf8 pragma, which will treat (and decode) all parameters as UTF-8 strings, but this will fail if you have any binary file upload fields. A better solution involves overriding the param method:
package CGI::as_utf8;
BEGIN {
use strict;
use warnings;
use CGI 3.47; # earlier versions have a UTF-8 double-decoding bug
{
no warnings 'redefine';
my $param_org = \&CGI::param;
my $might_decode = sub {
my $p = shift;
# make sure upload() filehandles are not modified
return $p if !$p || ( ref $p && fileno($p) );
utf8::decode($p); # may fail, but only logs an error
$p
};
*CGI::param = sub {
# setting a param goes through the original interface
goto &$param_org if scalar @_ != 2;
my ($q, $p) = @_; # assume object calls always
return wantarray
? map { $might_decode->($_) } $q->$param_org($p)
: $might_decode->( $q->$param_org($p) );
}
}
}
1
---
use CGI::as_utf8; # put this line in your app, e.g., in your CGI::Application module(s)
The above is rhesa's solution with a slight modification — utf8::decode() is used instead of Encode's decode_utf8(), as it is more efficient when only ASCII characters are involved (since the UTF8 flag is not set). Note that the module assumes that Web pages and forms are always UTF-8 encoded, and that the OO interface of CGI.pm is always used.
Note, browsers should encode form data in the same character encoding that was used to display the form. So, if you are sending UTF-8 forms, you should get UTF-8 encoded data back for text fields. You should not have to use accept-charset in your HTML markup.
Input - STDIN
editWhen a Web form is POSTed, form data comes into Perl via STDIN. If you are using CGI.pm, text form data is available via CGI.pm's param() method, and the previous section describes how to properly handle UTF-8 encoded text form data.
If you don't have any file uploads (i.e., all of your data is text), then instead of the CGI::as_utf8 module, you could add the following line of code to the beginning of your script to cause all data received on STDIN (i.e., all POSTed form data) to be automatically decoded as UTF-8:
binmode STDIN, ":encoding(UTF-8)";
Do not use
binmode STDIN, ":utf8"; # do NOT use this!
since it does not check that your incoming text is valid UTF-8, it simply marks it as UTF-8 — see Perl 5 Wiki.
The approach in the previous section is preferred, since it will "do the right thing" if there is any binary form data (file uploads).
If you are writing some other (non-CGI) program that receives data on STDIN, decode appropriately:
my $utf8_text = decode('UTF-8', readline STDIN);
my $iso8859_text = decode('ISO-8859-1', readline STDIN);
my $binary_data = read(...); # don't decode
Note that decode() always sets Perl's internal UTF8 flag.
Input - database
editIn the "use UTF-8 everywhere" model, configure your database to store values in UTF-8.
When reading data from a UTF-8 database, ensure incoming UTF-8 encoded string field data is UTF-8 decoded, but do not decode incoming binary field data.
Input - MySQL
editWith MySQL, UTF-8 decoding (and encoding) of string field data is automatic if you use the mysql_enable_utf8 database handle attribute:
use DBI();
my $dbh = DBI->connect('dbi:mysql:test_db', $username, $password,
{mysql_enable_utf8 => 1}
);
This means you should not call utf8::decode() (or any other UTF-8 decode function) on incoming string field data — the driver will do that for you. If the incoming data for a field only contains ASCII octets, the UTF8 flag is not set for that field (so it appears to be using utf8::decode()). The driver is also smart enough to not decode binary data.
Version 4.004 or higher of DBD::mysql is required. UTF-8 was first available in MySQL v4.1. As of v5.0, it is the system default.
Input - PostgreSQL
editWith PostgreSQL, as of DBD::Pg version 3.0.0, UTF-8 decoding (and encoding) of string field data is automatic if the database is also set to UTF-8.
For previous versions, you must use the pg_enable_utf8 database handle attribute which will set all non-binary data as UTF-8 regardless of the client_encoding value.
use DBI();
my $dbh = DBI->connect('dbi:Pg:test_db', $username, $password,
{pg_enable_utf8 => 1}
);
This means you should not call utf8::decode() (or any other UTF-8 decode function) on incoming string field data — the DBD::Pg driver will do that for you. The driver is also smart enough to not decode binary data.
The default client_encoding is to use the database encoding, so if your database is UTF-8 then it will be set by default. In other cases you may need to tell PostgreSQL to use UTF-8 when sending data out of the database:
SET CLIENT_ENCODING TO 'UTF8';
or
SET NAMES 'UTF8';
For example, with Rose::DB:
__PACKAGE__->register_db(
domain => 'development',
...
connect_options => {
pg_server_prepare => 0,
pg_enable_utf8 => 1,
},
post_connect_sql => "SET CLIENT_ENCODING TO 'UTF8';",
);
See Automatic Character Set Conversion Between Server and Client
2. Processing strings
editOnce all incoming strings have been decoded into UTF-8 internally, you can process your text as normal. Regular expressions will work (if using Perl v5.8 or higher).
If you create any strings in your source code that contain non-ASCII characters (characters above 0x7f), ensure you upgrade them to internal UTF-8 encoding:
my $text = "\xE0"; # 0xE0 = à in ISO-8859-1
utf8::upgrade($text);
my $unicode_char = "\x{00f1}"; # U+00F1 = ñ
utf8::upgrade($unicode_char);
Perl 5 "Unicode bug"
edit(2011-05-03 update: v5.14 is now available, finally banishing the Unicode bug.)
Without a locale specified, if you have native/N8CS strings with characters in the 0x80-0xFF (128-255) range, then \d, \s, \w, \D, \S, \W (hence regular expressions), and lc(), uc(), etc. may not work as expected, since the non-ASCII part (0x80-0xFF) of the character set is ignored for those operations. (This is another reason to try and use UTF-8 everywhere.) Without a locale, Perl can't properly interpret characters in this range, since different encodings use different characters in this range, so it ignores them -- this is called ASCII semantics.
There are three ways to avoid this "Unicode bug". The best is to upgrade to Perl 5.14 and put a use 5.014; at the top of your file. The other two involve getting the natively encoded string to switch to UTF-8 encoding — because when the internal encoding is UTF-8, Unicode semantics are used, which always work as expected.
1. Follow "best practice" and always properly decode all external input text/octets. During decoding, any text/octets found to contain non-ASCII characters will be converted to UTF-8 internal encoding. For example
use Encode;
# suppose $windows1252_octets contains text from an external input, and it contains the character
# "\xE0" (0xE0 = à). String $windows1252_octets will exhibit the Unicode bug -- it won't match /\w/
my $utf8_string = decode('cp1252',$windows1252_octets); # no Unicode bug, $utf8_string matches /\w/
2. Use utf8::upgrade($native_string) to force $native_string to switch to UTF-8 internal encoding. (Even if the string only contains ASCII characters, it is still "upgraded" to UTF-8.)
my $text = "\xE0"; # will exhibit Unicode bug, won't match /\w/
utf8::upgrade($text); # no Unicode bug, matches /\w/
Note that with internal UTF-8 encoding, \w represents a much, much larger set of characters, so regex operations will be slower (vs. native encoding). TBD: what is the actual performance degradation? What is the character set for \w with Unicode semantics?
See also Unicode::Semantics.
2010-04-19 update: v5.12 is now available, and the "case changing component" has been fixed: "Perl 5.12 now bundles Unicode 5.2. The “feature” pragma now supports the new “unicode_strings” feature:
use feature "unicode_strings";
This will turn on Unicode semantics for all case changing operations on strings, regardless of how they are currently encoded internally." Read more.
3. Encoding and output
editOutput from a Web program includes STDOUT (that is sent to your browser for a CGI program), stderr (that usually goes to the Web server's error log), database writes, log file output, etc.
If outgoing text is not encoded, the text will be sent using the bytes in Perl's internal format, which could be a mixture of native/N8CS and UTF-8. This may work, but don't take a chance — "best practice" calls for explicitly encoding all output appropriately.
Perl will warn you if you print a string with a character that has an ordinal value greater than 255:
$ perl -e 'print "\x{0100}\n"'
Wide character in print at -e line 1.
Ā
To avoid this warning, explicitly encode output (as described below).
Output - STDOUT
editTo ensure all output going back to the Web browser (i.e., STDOUT) is UTF8-encoded, add the following near the top of your Perl script:
binmode STDOUT, ":encoding(utf8)";
If you want to be a little more efficient (but not follow "best practice"), you can opt to only encode the outgoing page if it is flagged as UTF-8:
if(utf8::is_utf8($page)) {
utf8::encode($page);
}
# else, $page is natively encoded, so skip encoding for output
Here is a snippet that can be used with the CGI::Application framework:
__PACKAGE__->add_callback('postrun', sub {
my $self = shift;
# Make sure the output is utf8 encoded if it needs it
if($_[0] && ${$_[0]} && utf8::is_utf8(${$_[0]}) ){
utf8::encode( ${$_[0]} );
# ${$_[0]} .= 'utf8::encode() called'; # useful for debugging
}
});
The above code should be put into CGI::Application base class(es). Optionally, the code can be added to cgiapp_postrun().
Note that all of the above encoding techniques will only work properly if all of the input UTF-8 octets were properly decoded.
Output - database
editAs mentioned above, in the "use UTF-8 everywhere" model, configure your database to store values in UTF-8.
When writing data to a UTF-8 database (INSERT, UPDATE, etc.), ensure your UTF-8 strings get UTF-8 encoded before being written to the database. Do not encode binary field data.
Output - MySQL
editAs mentioned above, UTF-8 encoding (and decoding) of string field data is automatic if you use the mysql_enable_utf8 database handle attribute. This means you should not call utf8::encode() (or any other UTF-8 encode function) on your strings when using this attribute — the driver will do that for you. The driver is also smart enough to not encode binary data.
Version 4.004 or higher of DBD::mysql is required. UTF-8 was first available in MySQL v4.1. As of v5.0, it is the system default.
Output - PostgreSQL
editAs mentioned above, UTF-8 encoding (and decoding) of string field data is automatic if you use the pg_enable_utf8 database handle attribute. This means you should not call utf8::encode() (or any other UTF-8 encode function) on your strings when using this attribute — the DBD::Pg driver will do that for you. The driver is also smart enough to not encode binary data.
You may (TBD: when?) also need to tell PostgreSQL to expect UTF-8 coming into the database:
SET CLIENT_ENCODING TO 'UTF8';
or
SET NAMES 'UTF8';
See Automatic Character Set Conversion Between Server and Client
Output - files, file handles
editIf you need to write to files, Perl can automatically encode data as it is written using PerlIO layers:
open my $out_fh, ">:utf8", $filename or die; # auto UTF-8 encoding on write
If you already have an open filehandle:
binmode $out2_fh, ':utf8';
Tell the browser to use UTF-8
editTo serve a UTF-8 encoded page to a browser, "best practice" is to specify the UTF-8 charset in an HTTP Content-Type header and inside the HTML file in a content-type <meta> tag. CGI.pm defaults to sending the following Content-Type header:
Content-Type: text/html; charset=ISO-8859-1
Add the following to cause UTF-8 to be used instead of ISO-8859-1, where $q is your CGI object:
$q->charset('UTF-8');
If you are using the CGI::Application framework, put the above line in cgiapp_init().
If you are not using CGI.pm to generate your HTML markup, put the following meta tag as the first meta tag in the <header> section of your HTML markup:
<meta http-equiv="content-type" content="text/html; charset=UTF-8" />
Perl source code
editIf you only need to embed a few Unicode characters in a few strings in your source code, you do not need to save your source code/file in UTF-8. Instead, use \x{...} or chr() in your code:
my $smiley = "\x{263a}";
or
my $smiley = chr(0x263a);
If you have a lot of Unicode characters, or you prefer to save your source code in UTF-8, then you need to tell Perl that your source code is UTF-8 encoded. Do this by adding the following line to your source code:
use utf8; # this script is in UTF-8
This is the only reason your program should ever have the above line -- see utf8.
If your source code is UTF-8 encoded, make sure your editor supports reading, editing, and writing in UTF-8!
Gotchas
editOften you may not notice Unicode issues until characters with codepoints above 128 are used. This is because ASCII, ISO-8859-1, Windows-1252, and UTF-8 are all encoded with the same one-byte values for the first 128 Unicode codepoints. To give your application a good Unicode test, try a character in the 0x80 - 0x9F (128-159) range, and a character above 0xFF (255).
Wide character in print at ...
editPerl will warn you if you print a string that has a character with an ordinal value greater than 255 (hence it is a "wide" character that requires more than one byte of storage):
Wide character in print at ... line ...
Explicitly encode your output to avoid this warning.
Cannot decode string with wide characters at ...
editIf you receive this error, your code is probably trying to decode the same string a second time, which will fail.
Web server always sends an ISO-8859-1 header
editIf you followed the steps above, but your pages are not being displayed properly, it could be that your Web server is configured to always send a particular character encoding in a header, such as ISO-8859-1. To determine if a content-type header is being sent by the Web server:
$ lwp-request -de www.bing.com | grep Content
Apache may be configured with the following:
AddDefaultCharset ISO-8859-1
If you can, remove that line, or change it to
AddDefaultCharset UTF-8
if all of the pages served by the server use UTF-8. See also When Apache and UTF-8 Fight.
ISO-8859-1 vs Windows-1252
editSince you are learning about character encodings, you need to be aware of the difference between the international ISO-8859-1 and the Microsoft proprietary Windows-1252. From Windows-1252:
[Windows-1252] is a superset of ISO 8859-1 in terms of printable characters, but differs from the IANA's ISO-8859-1 by using displayable characters rather than control characters in the 80 to 9F (hex) range. […] It is very common to mislabel Windows-1252 text with the charset label ISO-8859-1. […] Most modern web browsers and e-mail clients treat the media type charset ISO-8859-1 as Windows-1252 to accommodate such mislabeling. This is now standard behavior in the HTML5 specification, which requires that documents advertised as ISO-8859-1 actually be parsed with the Windows-1252 encoding.
Here's a fun program to try:
my @undefined_chars_in_windows_1252 = (0x81, 0x8d, 0x8f, 0x90, 0x9d);
my %h = map { $_ => undef } @undefined_chars_in_windows_1252;
foreach my $i (0x80 .. 0x9f) {
next if exists $h{$i};
printf "%02x:%c ", $i,$i;
}
What do you see? Do you see the Windows-1252 characters, no characters, square boxes? If you are using PuTTY, Change Settings... Window, Translation and try selecting ISO-8859-1 or Windows-1252 and run the program again.
Microsoft "smart" quotes
editMicrosoft Word uses those nice left and right fancy/smart quotes. If you copy-paste those characters into a Web form that was served with a Windows-1252 charset (or possibly even an ISO-8859-1 charset), the characters may be submitted to the Web server using the nebulous 0x80-0x9F (128-159) range. (Recall that Unicode defines control characters in this range — not printable characters like smart quotes.) If your Perl script does not decode the submitted form properly (i.e., according the same character encoding that the Web form used), you will get gibberish.
Decode and encode correctly and you will not have any problems with Microsoft smart quotes or any of the other characters in the nebulous range. Better yet, if you serve all Web pages as UTF-8, submitted forms should never contain these nebulous values, since the "paste" operation should automagically convert these characters to valid Unicode characters. Your Perl script will then only receive valid UTF-8 encoded characters.
Strange characters in my browser
editStrange character: �
This is Unicode's "replacement character" (codepoint U+FFFD), which is used to indicate when a Unicode parser (such as a browser) was not able to decode a stream of Unicode encoded data. The problem is likely an encode/decode problem somewhere in the chain. (U+FFFD encodes to EF BF BD in UTF-8. If you save the Web page and then open it in bvi, you may see EF BF BD.) IE displays the replacement character as the empty square box. Firefox uses the black diamond with the question mark.
Usually, these replacement characters appear because the HTML data is Windows-1252 encoded, but the browser was instructed to use UTF-8 encoding. In your browser, select View->Character Encoding and see if it is set to UTF-8. If so, try selecting Windows-1252 or Western European (Windows) and see if that resolves the problem. If if does, then you know that the Web server is serving up the wrong character encoding — there is a mismatch between what is being sent (i.e., how the data is encoded), and what character set the browser is being told to use (i.e., HTTP header and/or meta tag). If it doesn't resolve the problem, it might be that you don't have a Unicode font installed on your computer, or the Unicode font does not have a glyph for that particular character.
Strange characters: ‘ ’ “ †• – —
These are the individual characters that correspond to the multi-byte UTF-8 encodings for the following Windows-1252 characters:
‘ ’ “ ” • – —
that are in the nebulous 0x80-0x9F (128-159) range. Usually, these characters appear because the HTML data is UTF-8 encoded, but the browser was instructed to use ISO-8859-1 or Windows-1252. In your browser, try changing the encoding to UTF-8 and see if that resolves the problem. If that doesn't resolve the problem, or if the encoding is already set to UTF-8, there may be a double encoding problem somewhere.
Strange characters: â â â â ⢠â â
These also correspond to some of the characters in the nebulous 0x80-0x9F (128-159) range. If you see the above sequences, it is likely that you forgot to decode incoming UTF-8 data (such as form data submitted from an UTF-8 encoded HTML form) in your Perl program and then you UTF-8 encoded it for output — a natively encoded string was UTF-8 encoded (not good). Fix the problem by calling utf8::decode() on the incoming UTF-8 encoded data.
Strange characters in my editor
edit- ensure your editor supports reading, editing, and writing in UTF-8
- ensure you set your editor to use a Unicode font
- ensure you have a Unicode font installed
Install a Unicode font on Windows
editIf you have one of the Microsoft products listed on this page, you should have the Arial Unicode MS font. If it is not installed, follow these steps to install it: Add/Remove Programs, select MS-Office, Add or Remove Features, click "Choose advanced", Office Shared Features, International Support, Universal Font. Apply the changes and restart your Web browser.
I asked for UTF-8 but I got something else!?
editIf you specifically asked for UTF-8 text, but the octet stream you receive is not valid UTF-8 encoding, in many cases you can probably assume that the incoming text/octets are ISO-8859-1/Latin-1 or Windows-1252. Decode with Windows-1252, since it is a superset of ISO-8859-1.
Double encoding
editIf you don't decode UTF-8 text/octets, Perl will assume they are encoded with N8CS (often ISO-8859-1/Latin-1). This means that the individual octets of a multi-byte UTF-8 character are seen as separate characters (not good). If these separate characters are later encoded to UTF-8 for output, a "double encoding" results. This is similar to HTML double encoding — e.g., &gt; instead of >.
Automatic font substitution
editMost modern browsers and word processors perform font substitution, which means that if a character is not in the current font, the application will search through all of your fonts until it finds one containing that character and it will then display that character using the glyph in that font.
Sometimes IE7 and IE8 do not seem to perform font substitution correctly. One workaround is to specify a Unicode font as the first font in the CSS font-family property. IE6 is not considered a modern browser, and it does not perform font substitution.
Misc
editCreate Unicode characters
editOn Windows, you can always use the Character Map application to select, copy, and (switch to your application then) paste a Unicode character. Ensure the "Character set" drop-down box is set to "Unicode". You can also use the application to view fonts, characters, and Unicode codepoint values for each character.
In Perl
edit my $utf8_char = "\x{263a}"; # for codepoints above 0xFF
$utf8_char =~ /\x{263a}/; # same syntax for regex
my $cloud_char = chr(0x2601); # run-time, ord() does the reverse
If your Perl source code file is in UTF-8 format, you can enter the Unicode characters directly:
use utf8; # tells Perl this file is UTF-8 encoded
my $utf8_char = "☺"; # U+263a, "White Smiling Face"
In Web forms
editOn Windows:
- To insert a character from the Windows-1252 codepage: set the Num Lock key on, hold down Alt, then using the numeric keypad, type 0 followed by the decimal value of the character you want.
- To insert a character from the current DOS code page (usually CP-437): follow the same steps as above, but without the initial 0.
But wait, we wanted to insert a Unicode character, not a Windows-1252 or CP-437 character! Well, Windows will convert those characters to Unicode/UTF-8 for us if the application expects UTF-8.
In a Web form (textbox or textarea) type Alt-0147 to generate one of those pesky smart quotes from the Windows-1252 character set. If the Web page's character encoding is set for UTF-8, Windows should translate the 147 character into the corresponding UTF-8 encoding. (Internally, Windows probably translates the 0147 to UTF-16, which is then translated into the character set in use by the application. In this scenario, the character set is Unicode, and Windows-1252 character 147 is translated to its Unicode codepoint equivalent, U+201C.) When the form is submitted, the character should be sent to the Web server UTF-8 encoded as three octets: E2 80 9C — this is what U+201C looks like when encoded with UTF-8.
If the Web page's character encoding is instead set to Windows-1252, the character should be sent as a single octet: 0x93 (that is 147 decimal). If the Web page's character encoding is instead set to ISO-8859-1, the character will also be sent as a single octet, but the value may be either 0x93 or 0x22 (0x22 is the ASCII and ISO-8859-1 quote character). If the browser uses the superset Windows-1252 encoding when ISO-8859-1 is specified, 0x93 is sent. Otherwise, the character will be translated to the only quote character officially defined in ISO-8859-1, 0x22.
Hopefully you see why it is imperative to know which encoding was used for the incoming form/text, so that it can be decoded properly (as UTF-8 or Windows-1252) in your Perl program.
See also How do I enter ... - Yahoo Answers.
UTF-8 vs utf8
editAs of Perl 5.8.7, UTF-8 is the strict, official UTF-8. The Encode module will complain if you try to encode or decode invalid UTF-8, e.g.,
encode("UTF-8", "\x{FFFF_FFFF}", 1); # croaks
In contrast, utf8 is the liberal, lax, version, allowing just about any 4-byte values:
encode("utf8", "\x{FFFF_FFFF}", 1); # okay
encode_utf8("\x{FFFF_FFFF}", 1); # okay
Encode as of version 2.10 knows the difference.
utf8::encode() and utf8::decode() use official UTF-8.
Encode module vs built-in/core utf8::
editTo decode and encode UTF-8, you can use the Encode module or the functions defined in the utf8:: package by the Perl core. The Encode module is more flexible, allowing different ways of handling malformed data. However, the utf8:: package can do some different tricks.
You should be aware of a bug in the Encode module: whenever text is decoded using the Encode module, the UTF8 flag is always turned on. The documentation would lead you to believe that the UTF8 flag is off if the text only contains ASCII characters and you are decoding UTF-8. This is not what happens — the flag is always turned on, as the table below depicts.
There are performance gains to be had if the UTF8 flag can be kept off after decoding (and this is fine if the text only contains ASCII octets). Use utf8::decode() to obtain this efficiency, since it does not turn the flag on if the octet sequence only contains ASCII octets. (This is the decode function I normally use.)
Below, see Encode's documentation for CHECK options, which relate to how the module handles malformed data.
Function | UTF8 flag | Description / Notes |
---|---|---|
$flag = utf8::is_utf8($string); | N/A | Tests whether $string is internally encoded as UTF-8. Returns false if not; otherwise returns true. |
$flag = utf8::decode($utf8_octets); | depends | Attempts to convert in-place the UTF-8 octet sequence into the corresponding N8CS or UTF-8 string, as appropriate. If $utf8_octets contains non-ASCII octets (i.e., multi-byte UTF-8 encoded characters), the UTF8 flag is turned on, and the resulting string is UTF-8. Otherwise, the UTF8 flag remains off, and the resulting string is N8CS. This is the only decode function that may result in an N8CS byte string. Returns false if $utf8_string is not UTF-8 encoded properly; otherwise returns true. |
$utf8_string = decode('UTF-8', $utf8_octets [, CHECK]) | turned on | Decodes the UTF-8 octet sequence into a UTF-8 character string. Strict, official UTF-8 decoding rules (see previous section for discussion) are followed. |
$utf8_string = decode('utf8', $utf8_octets [, CHECK]) | turned on | Decodes the UTF-8 octet sequence into a UTF-8 character string. Lax, liberal decoding rules (see previous section for discussion) are followed. |
$utf8_string = decode_utf8($utf8_octets [, CHECK]) | turned on | Decodes the UTF-8 octet sequence into a UTF-8 character string. Equivalent to decode("utf8", $utf8_octets), hence lax decoding is employed. |
$octet_count = utf8::upgrade($n8cs_string); | turned on | Converts in-place the N8CS byte string into the corresponding UTF-8 character string. Returns the number of octets now used to represent the string internally as UTF-8. This function should be used to convert N8CS byte strings with characters in the 0x80-0xFF range to UTF-8, thereby avoiding the Perl 5 "Unicode Bug". |
utf8::encode($string) | turned off | Converts in-place the N8CS or UTF-8 $string into a UTF-8 octet sequence. |
$utf8_octets = encode('UTF-8', $string [, CHECK]) | turned off | Encodes the N8CS or UTF-8 $string into a UTF-8 octet sequence. Strict, official UTF-8 encoding rules (see previous section for discussion) are followed. |
$utf8_octets = encode('utf8', $string) | turned off | Encodes the N8CS or UTF-8 $string into a UTF-8 octet sequence. Lax, liberal UTF-8 encoding rules (see previous section for discussion) are followed. Since all possible characters have a lax utf8 representation, this function cannot fail. |
$utf8_octets = encode_utf8($string) | turned off | Encodes the N8CS or UTF-8 $string into a UTF-8 octet sequence. Equivalent to encode("utf8", $string), hence lax encoding is employed. Since all possible characters have a lax utf8 representation, this function cannot fail. |
$flag = utf8::downgrade($utf8_string [, FAIL_OK]); | turned off | Converts in-place the UTF-8 character string to the equivalent N8CS byte string. Fails, if $utf8_string cannot be represented in N8CS encoding. On failure dies, unless FAIL_OK is true, then returns false. Returns true on success. |
Perl character encodings
editTo determine which character encodings your Perl supports:
perl -MEncode -le "print for Encode->encodings(':all')"
It is important to remember that Perl only uses two character encodings internally: native/byte and UTF-8/character. Any characters encoded with something other than N8CS, the platform's native 8-bit character set (often ISO-8859-1/Latin-1), must be decoded as it enters Perl.
What does Website "x" use?
editView a page, then in your browser, View->Character Encoding to see which encoding was selected. Also look at the HTML source and see if the meta tag is present:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
You can also see what Content-Type header is being returned using:
$ lwp-request -de www.bing.com | grep Content
This wiki uses UTF-8.
HTML character entities
editIn your UTF-8 travels, you may come across HTML character entities. Starting with HTML 4.0, 252 character entities are supported. Each of these has a Unicode codepoint and an entity name. Either can be used in HTML markup. For example, the registered sign can be represented in HTML as either ® or ®
Many fonts support this set of characters, and if the set is sufficient for your application, UTF-8 may not be required, but your application will need to use the HTML encoding where ever a special character is needed.
Operating systems and Unicode
editIt is interesting to note which Unicode encoding popular Operating Systems use. From Wikipedia: "Windows NT (and its descendants, Windows 2000, Windows XP, Windows Vista and Windows 7), uses UTF-16 as the sole internal character encoding. The Java and .NET bytecode environments, Mac OS X, and KDE also use it for internal representation. UTF-8 has become the main storage encoding on most Unix-like operating systems (though others are also used by some libraries) because it is a relatively easy replacement for traditional extended ASCII character sets."
References
edit- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) - by Joel Spolsky
- FMTYEWTK about Characters vs Bytes - Perlmonks
- CGI::Application and UTF-8 Form Processing example - by Mark Rajcok
- Perl Unicode tutorial
- Perl Unicode FAQ
- Perl utf8 pragma
- Perl Encode module - handles all character encoding and decoding
- Unicode - Wikipedia
- Perl Unicode introduction
- Unicode support in Perl
- Unicode::Semantics - work around the Perl 5 Unicode bug
- there are many Unicode:xxx modules on CPAN
- UTF-8 round trip with MySQL - Perlmonks
- CGI::Application - Which is the proper way of handling and outputting utf8 - Perlmonks
- Understanding CGI.pm and UTF-8 handling - Perlmonks
- UTF-8 and Unicode FAQ for Unix/Linux
- Perl Unicode Mailing List <perl-unicode@perl.org>
Footnotes
edit^ - N8CS is a term that was coined for this document. Do not expect to see this term used elsewhere.