day macha
Unicode and UTF-8 Explained

I spent some time reading this excellent but lengthy guide to Unicode and UTF-8 and playing with Unicode a little. Here’s a hopefully clearer and gently paced guide telling you what you need to know, especially if you’re working with a programming language that does not automatically handle Unicode (it’s a good overview for non-programmers, too).


Before Unicode

Everything, as you know, in a computer is stored as naughts and ones. These naughts and ones can represent numbers. 01000001 represents 65. These are also used to represent letters. 65 (1000001) is the uppercase letter A, for example.


The mapping of numbers to letters started off in a mapping called ASCII (where 65 represents upper case A). But ASCII only dealt with English letters. So new mappings were made for Chinese, for Tibetan, for a whole number for European languages, for Russian, etc, etc.

But this was not an ideal solution. Every application that wanted to deal with, not only English, but many other languages had to deal with all these mappings (called encodings or character sets). Eventually an encoding was invented to deal with all possible languages: Unicode.

The problem of backwards compatibility

But there was a problem. The many applications which only deal with ASCII thought one letter was an eight-bit (a 0 or 1 is a bit) binary number like 01000001 (65). But now unicode uses more than eight-bits to represent a single letter.

For example, in Unicode the Cyrillic letter “ҩ” is represented in binary as two sets of eight-bits (eight bits is called a byte from now one): 00000101 01000001. But ASCII-only applications see that as two letters, two bytes. It would take the first byte (00000100) as a special command meaning end of transmission, and 01000001 as the letter A. Not Cyrillic at all!

Enter UTF-8

So, to make Unicode backwards compatible with ASCII-only applications Ken Thompson and Rob Pike created a Unicode Translation Format , or UTF-8. It’s aim was to make all the old ASCII letters like 01000001 (65, A) remain one byte, and to ensure letters represented by more than one byte could not be confused with multiple ASCII letters.

It does this, in brief, in two ways. It keeps the old ASCII characters as only one byte. And it translates each Unicode letter which needs more than one byte by placing a 1 in the left-most bit of each byte. A fuller explanation is here.

For example, the Cyrillic letter “ҩ”, in UTF-8, is represented by 11010010 10101001. Because there is a 1 at the front of each byte, an ASCII-only application will never confuse the bytes for ASCII letters. And any application that does support UTF-8 (many, many applications) will understand it for what it is: the Cyrillic letter “ҩ”.

What this means for casual users

From the users’ point of view, not much. The only things to do is use UTF-8 applications to read UTF-8 text. For example, many websites, that is the text files that make up a website, are encoded in UTF-8. As long as your web-browser supports UTF-8, and all the ones released since 2000 or so will, you can view that website perfectly well. The same is true for viewing documents in Open Office or Microsoft Word, for example.

95% of the time detecting UTF-8 text, and then converting that to pure Unicode, and then displaying the Unicode happens transparently to the user. The only thing the user needs is a font that can display all the letters required for all the different languages. If you’re using OS X, Windows or a modern Linux/Unix desktop the stock fonts will support them fine.

What this means for developers (should your language not deal with it automatically) or command line Unix/Linux users

If your program only deals with bytes (i.e. not the multiple bytes needed to display Unicode characters) you may be, surprisingly, okay.

If you’re comparing two strings, byte by byte, then your comparison will still work, even though one byte no longer represents one letter. If you’re storing letters, byte by byte, to a file then you’re still storing all the information. If you’re outputting bytes to somewhere, as long as that somewhere understands UTF-8, your output will be fine. Etc.

If you’re counting the bytes in a string, however, you must realise each byte no longer represents one letter, one character. Problems only arise if your program needs to understand that multiple bytes may represent only one character, one letter. Generally, I’ve found this is not the case. But your mileage may vary.

In the UNIX/Linux command line, set your locale. This allows you to work with UTF-8 characters. “locale -a” see all the possible locales: en_HK.uft8, for example. Then, in bash, issue “export LANG=en_HK.UTF8”. Now “locale” will show you your new locale. Remember to shutdown X, reset the locale and load up X again. Putting your new locale in /etc/profile, or similar, is a good idea.

Blog comments powered by Disqus