Zoph/Upgrading/Changing your database to UTF-8

As of Zoph v0.8.1, Zoph is using UTF-8 character encoding. In order to fully benefit from this change, you need to convert your database to UTF-8 encoding as well; especially if you are using any non-Latin characters in your database.

Important!

This could be a riskful operation, make sure you make a backup of your database and do not delete it until you are very sure everything has converted correctly.

Converting the database to UTF-8

edit

The safest way to perform this operation is probably to create a second Zoph database and test against that. This document will describe that method. If you wish to go back to the old database name later or can only use a single database, because of hosting restrictions, you will have to drop the "old" database and re-use the same database. Double-check your backup in that case!

Create a backup

mysqldump -u root -p zoph | gzip > zoph.sql.gz

Create a new database, with UTF-8 encoding

mysql -u root -p
mysql> CREATE DATABASE zophutf8 DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;
mysql> exit
Important!

You can not use ALTER TABLE to do this, because MySQL will then set the character encoding of the database to UTF-8, but will leave it on whatever it is (usually Latin-1) for the tables

Find out what your current encoding is:

zcat zoph.sql.gz | grep CHARSET

Usually you will see several lines that look like this

) ENGINE=MyISAM DEFAULT CHARSET=latin1;

In this case we have been using latin1 encodig so far. You will also see that some lines contain things like AUTO_INCREMENT=161, those are not important for us now.

We will now change the encoding in this file to UTF-8. In the tests I have done, it was not needed to really convert the character sets, it was sufficient to just change the name of the encoding. If you do run into issues with texts that were correctly displaying before the change and no longer after the change, you may need to use iconv to do a conversion. If you do need that, let me know, or add the information to this document.

zcat zoph.sql.gz | sed "s/CHARSET=latin1/CHARSET=utf8/g" | gzip > zophutf8.sql.gz

Import the database into the newly create database:

zcat zophutf8.sql.gz | mysql -u root -p zophutf8

You need to update zoph.ini and give rights to the user you use to connect to MySQL.

zoph.ini:
db_name = "zophutf8"
db_user = "zoph_rw"
db_pass = "PASSWORD"
mysql -u root -p
mysql> grant select, insert, update, delete on zophutf8.* to zoph_rw@localhost identified by 'PASSWORD';
mysql> grant all on zophutf8.* to zoph_admin identified by 'PASSWORD';

Now, perform the rest of the update and test!

Changing the database name back to zoph

edit

If you have tested and confirmed that the new database is working for you, and you want to change the name of the database back to zoph, you need to perform these steps. These steps are not really necessary, but do keep in mind that future Zoph documentation will refer to the database by the name of zoph and not zophutf8.

mysqldump -u root -p zophutf8 | gzip > zophutf8.sql.gz
mysqladmin -u root -p drop zoph
mysql -u root -p
mysql> CREATE DATABASE zoph DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;
mysql> exit
zcat zophutf8.sql.gz | mysql -u root -p zoph
zoph.ini:
db_name = "zophutf8"
db_user = "zoph_rw"
db_pass = "PASSWORD"
mysql -u root -p
mysql> grant select, insert, update, delete on zoph.* to zoph_rw@localhost identified by 'PASSWORD';
mysql> grant all on zoph.* to zoph_admin identified by 'PASSWORD';

Rolling back to the old database

edit

If you find during testing that the database conversion has not worked for you, you can easily roll back to the old database by changing config.inc.php:

define('DB_NAME', 'zoph')

Please report a bug or post on the forum if you do run into problems.

Background

edit

If you don't know what UTF-8 is, here's some background. If you wish to find out more, follow the links to Wikipedia; the Wikipedia page on character encoding is a good starting point as well.

What is UTF-8?

edit

UTF-8 is a way of encoding characters.

Computers have been using ASCII-encoding, which is a 7-bit encoding that has room for 128 different characters; 33 of those are used by control characters (such as escape and backspace), so there are 96 left for 26 lowercase letters, 26 uppercase letters, 10 digits and a handful symbols such as punctuation marks; leaving no room for "international" characters, such as é, ñ or ä. This is often referred to as US-ASCII.

However, when 8-bit character sets (often wrongfully referred to as "8 bit ASCII") started to get used, there was additional space for about 127 characters. That seems like a lot, but it was in no way sufficient to encode all the different characters used in different languages. A whole range of character sets were created to accommodate all of them. Eventually, this more or less standardized in a few dozen character sets, of which Latin-1 or ISO-8859-1, which contains all the characters needed by most West-European languages, is probably one of the most common.

Another issue arises with languages that have more than 26 (or 255 for that matter) characters, such as Chinese or Japanese. For these languages several double-byte encoded character sets were defined, such as Big5 and ISO-2022-JP.

Converting between different character sets is often not possible: besides the first 127 characters, containing the 'standard' Latin letters, numbers and symbols, there is little overlap between the character sets. Also, a computer needs to be told which character set it is reading, for example, code 232 in Latin-1 is è, while in Latin-2, which is meant for Central and Eastern European languages, it means č. Without a hint, it does not know how to decode it.

Eventually, a solution to that problem was created, with UTF-16, a double-byte encoding system with sufficient space to encode all writing systems currently in use and by the possibility of even using 32 bits, the number of characters that can be encoded is virtually endless. However, this system also had its issues: the limited backward compatibility with ASCII- or 8-bit ASCII-based encodings and that any text in Latin characters would require a double amount of storage (16 bits per character, instead of 8).

Eventually, UTF-8 was created to resolve this issue, where a character could be encoded in 8, 16 or 32 bits, where the 8-bit encoding would correspond to the original 7-bit ASCII set, plus a leading zero - just like in the 8-bit ASCII-based character sets.

How did Zoph handle this?

edit

Zoph has been a multilingual program for a long time, only, the multi-language implementation of Zoph contained a design-error: Zoph would change its encoding scheme based on the language the user was using Zoph in. So, this would mean that a user using the Turkish language setting would create an album, Zoph would be using ISO-8859-9 to store data in the database; when another user, using the French translation and therefore using ISO-8859-1 would retrieve it, Zoph would be telling the browser "this is ISO-8859-1 encoded text" and then send ISO-8859-9 encoded text. This would usually not be a problem, except when using characters outside the US-ASCII set, then you'd end up with names like "Stato della Citt? del Vaticano", "Česká republika" and "G?teborg".

The solution

edit

The solution to this problem, is to convert all the translations to UTF-8, always communicate with the browser with UTF-8 encoding and store all data in the database with UTF-8 encoding. The first 2, I have taken care of, however, you will need to convert your database to UTF-8.

Will this solve all the issues?

edit

No, if you currently have texts like the ones mentionned above, with question marks or unreadable characters in your Zoph installation, chances are that you will not automatically get rid of those. Unfortunately, since there is no way of knowing in which encoding this was written to the database, it is not always possible to recover them.

But... any future additions to the database, should no longer suffer from this problem. (Tell me if you do have issues!)