A basic walkthrough to avoid having problems with "funny characters" on your website.
Why think about multilingualism? Most developers bring up this question when asked about their websites. The common problem is that you do not matter about that stuff until someone fills in your comment form and his name suddenly displays funny characters (question marks, squares or other funny characters) in between. Now you have to think about it!
The biggest problem is that most developers lack knowledge about Internationalisation, Localisation, Character encodings, Unicode and all those terms connected with multilingualism. The following article should give you a basic understanding and show you how to avoid those funny characters.
At first some short definitions:
Internationalization (I18N) is the process of designing a software application so that it can be adapted to various languages and regions without engineering changes. Localization (L10N) is the process of adapting software for a specific region or language. (from Wikipedia)
A character set describes the repertoire of characters for a language or region. For example the latin or greek alphabet. The Character encoding describes how characters of a given character set are represented by a code. Examples are ASCII, the ISO-8859 family and UTF-8.
ASCII is a simple 7-bit character encoding. Originally developed in 1967 it can represent the latin lower- and uppercase letters, the ten digits, some punctuation marks and some control characters. This is enough for displaying texts in english language. The ISO-8859 standard describes 15 different 8 bit character sets. Each encoding contains 256 different characters, the first 128 are always compatible with ASCII.
The ISO-8859 family and ASCII are single byte character encodings because every character fits into a single byte. Character sets with more than 256 characters need an encoding which can use more than one byte per character (Multibyte character encodings).
Unicode (also called UCS for Universal Character Set) aims at containing all characters ever used in every culture on this planet. Each character has a unique number or is produced by combining several other characters. The standard exists since 1991 and is constantly under development. Unicode Version 5.0 (from June 2006) contains ca. 99.000 characters. The characters are distributed on different so called code pages.
UTF-8 is one possible character encoding for Unicode. With UTF-8 it is possible to encode every character contained in Unicode. Encoded characters range from one to six bytes. The advantage is that the first 128 characters are compatible with those from ASCII.
UTF-8 is therefor the answer to most of our encoding problems on websites. All operating systems, browsers, editors and other software should be able to handle the UTF-8 character encoding. With this encoding we can happily mix greek and japanese characters on one single page.
Preparations for UTF-8 output:
There are several places where you can specify the character encoding for you pages: XML declaration, HTML meta tag, HTTP Content-Type header. It is important to specify the same encoding in all definitions. You should not leave out a definition because this can result in the browser trying to guess the encoding from the informations available. You can for example send your document with UTF-8 definition only in the HTTP header. But if you save that document to disk and display the saved file, the browser is missing the encoding and has to guess, which does not always work well.
Example: Specifying UTF-8 in XML declaration and HTML meta tag:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
Setting the right header form within PHP:
header('Content-Type: text/html; charset=utf-8');
Setting UTF-8 as default in Apache:
AddDefaultCharset On AddDefaultCharset utf-8
in httpd.conf oder .htaccess
You can view the encoding used by the browser for example in Firefox by selecting page informations after right clicking on a document.
Receiving data from the browser
What happens to data that is transferred to PHP from a browser? Usually all browsers send data back in the encoding used to display the page itself. As an alternative you can specify the encoding by hand by using the accept-charset attribute of the form tag:<form accept-charset=“utf-8“>...</form>. So if you deliver your pages in UTF-8 you can assume that the data you reveice is also UTF-8.
Be careful when using data from other sources like Webservices or RSS-feeds. Always respect the encoding that is specified in the header and/or content. The best practice is to convert all incoming data immediately to UTF-8. Then you don't have to remember the character encoding in which it was originally received.
UTF-8 strings in PHP:
The bad thing is that PHP does not know about multibyte strings. The good thing is that it does not try to convert your data automatically. Strings in PHP are really only a sequence of bytes. So the call to strlen('Iñtërnâtiônàlizætiøn') gives back 27.
Fortunately there is the mbstring extension in PHP which brings multibyte character support to PHP. The extension defines multibyte safe versions of the string functions (e.g. mb_strlen()) which work fine with UTF-8 encoded strings. So mb_strlen('Iñtërnâtiônàlizætiøn', 'UTF-8') will give you the correct number of characters. With mbstring you can even enable PHP to always us the mbstring functions instead of the normal string functions.
Another useful extension is iconv, which enables you to convert between a huge number of different character encodings. It is especially helpful if you have data sources in different encodings and you want them all to be handled internally as UTF-8.
UTF-8 and MySQL
MySQL has good UTF-8 support since version 4.1. Just be careful because UTF-8 is called "UTF8" in MySQL! The character encoding used for your stored data can be defined per server, database, table or even column.
(CREATE | ALTER) TABLE ... DEFAULT CHARACTER SET utf8 (CREATE | ALTER) DATABASE ... DEFAULT CHARACTER SET utf8
Additionally you can specify a collation (sorting function) for each server, database, table or column.
The connection between PHP and MySQL needs to know that you are sending UTF-8 data. It is best to set this right after you connect to the database:
mysql_query(“SET NAMES 'utf8'”);
After that it is perfectly save to use your UTF-8 strings in your MySQL database.
With some simple methods it's easy to enable multi lingual use of your own webapplication or website. With consequent use of UTF-8 it is possible to have characters from different languages appear on one single site. Now your Greek friend can leave comments on your blog in greek characters. I bet you will not understand that, but at least it will be displayed correct :)
There is a very good FAQ about this topic by Kore Nordmann: http://kore-nordmann.de/blog/articles/php_charset_encoding_FAQ.html
One more time, using a framework can be very helpful.
Symfony handle very well internalization and localization.
Even if you want to make your own system, studying how they do for the interface, databases, … can be interesting.
Multilingual Websites with PHP – ThinkPHP /dev/blog
This article posted by Florian Eibeck on setting up php, Apache, and MySQL for dealing with international character sets is a good overview of the groundwork required to support a UTF-8 compliant web app. As Florian rightly points out, there are non-U…
Thanks for a nice article about this.
Also thanks to Kore for the FAQ-link.
P.S. Just for fun, some Danish characters below:
If I had read this 1 year ago I would have saved a lot of time!
Took me long time to discover the three main steps.
The html header/php header;
the database collation;
the sql selection using utf8;
I always using utf8. It’s good idea.
A suggestion as a follow up to this article would be to cover the NLS (Native Language Support) API though the php gettext tools: http://php.net/manual/en/book.gettext.php