WHAT IS UTF-8
|
This document contains a minimal collection of information for you to understand UTF-8 (which is the encoding used by this homepage). For more information, please check out the Unicode Consortium website. UTF-8 is a popular encoding form of the Unicode/ISO-10646 standard. Don't worry if that doesn't make much sense to you yet, read below and things will become clear. In the early days, there were two independent attempts to create a unified character set. One was the ISO 10646 project of the International Organization for Standardization (ISO), the other was the Unicode Project organized by a consortium later known as the Unicode Consortium. ISO came up with a standard called ISO 10646 which defines a huge 31-bit Universal Character Set (UCS). A 16-bit subset of UCS (which contains 65534 characters) is called the Basic Multilingual Plane (BMP) and is the part that gets populated first. Unicode Consortium, on the other hand, were working on its own standard called the Unicode standard. Having two independent standards is certainly not something people would call "unified". Both ISO and the Unicode Consortium realized that and decided to form a joined effort in 1991. Since then new versions of Unicode standard are made fully compatible and synchronized with the corresponding versions of ISO 10646. All characters are located at the same positions and have the same names in both standards. Theoretically, the 31-bit UCS can contain about two billion characters, the number of characters that are actually defined, however, is much smaller (but has been growing in time). Version 3.2 of the Unicode standard, for instance, provided codes for 95221 characters (which already goes beyond the BMP). Unicode is stable, the growing process of the Unicode is strictly additive, namely only new characters will be added, no existing characters will be removed or renamed in the future. Unicode and ISO 10646 are first of all code tables that assign integer numbers to characters. Hexadecimal numbers for those integer values are commonly preceded by "U+". For instance, U+0041 is the character "Latin capital letter A". Given the integer values, it is up to the character encoding standards (encoding forms) to define how these values should be represented as a byte sequence. Unicode standard defines three encoding forms that allow the same character data to be transmitted in a byte, word or double word oriented format (i.e. in 8, 16, or 32-bits per code unit). These three encoding forms are called UTF-8, UTF-16 and UTF-32 respectively. There are other encoding forms defined by ISO. The abbrevation UTF stands for Unicode (or UCS) Transformation Format. UTF-8 transforms all Unicode characters into a variable length byte sequence, it has the following properties:
To fully encode all the 231 characters in UCS, a UTF-8 encoded character can be up to six bytes long, but the 16-bit BMP characters are only up to three bytes long. The following formats of byte sequence are used to represent a character in UTF-8:
The "xxx" bits are filled with the bits of the character code number as assigned by the Unicode standard. For instance, the Unicode character U+2260 = 0010 0010 0110 0000 (the symbol "not equal to" - ≠) belongs to the 3rd category in the table. Fill these 16 bits into the 16 "x" positions in the format one obtains the UTF-8 encoding of the character as: 11100010 10001001 10100000 Notice the digits highlighted matches the bits (highlighted before) assigned by the Unicode standard. As mentioned before, UTF-8 is a popular encoding form for Unicode. Why is it so? The reason lies in the fact that all ASCII characters are encoded as a single byte in UTF-8 which is not only fully backward compatible, but also space efficient for US and many European users. In general, UTF-8 costs no extra space for US ASCII, only a few percent more for ISO-8859-1 (aka Latin-1, covers most West European languages), 50% more for Chinese/Japanese/Korean, 100% more for Greek and Cyrillic. As a comparison, UTF-16 costs no more space for Chinese/Japanese/Korean, 100% more for US ASCII and ISO-8859-1, Greek and Cyrillic. UTF-32 is a fixed width encoding that costs the most amount of space. Since US and West European account for most of the internet users, English accounts for most of the information distributed on the web (at the time of this writing), so UTF-8 has quickly become the most popular Unicode encoding form for the web. Finally, a note about a universal way to enter UTF-8 encoded characters for the web. For instance, to input U+2014 (the em dash "—") to a web document, one can use either "—" ("x" means what follow are in hexadecimal form) or "—" (8212 is x2014 in decimal form). Any Unicode characters can be entered in this form (not a convenient way, but helpful to know if you don't have any language specific input software). |