UTF-8 (idea) by spitzak - Everything2.com

The Unicode (or UTF-32) character, up to 31 bits, is encoded as follows:

Unicode (Hex).... Byte1... Byte2... Byte3... Byte4... Byte5... Byte6... 00000000-0000007f 0xxxxxxx 00000080-000007ff 110xxxxx 10xxxxxx 00000800-0000ffff 1110xxxx 10xxxxxx 10xxxxxx 00080000-001fffff 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 00200000-03ffffff 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxx 04000000-7fffffff 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxx 10xxxxx

Each 'x' indicates a single bit from the Unicode encoding of the character.

You must use the shortest possible encoding for each character, so a given character can only appear with one encoding, to avoid security problems with unwanted characters getting past filtering. UTF-16 "surrogate characters" are not allowed (though they appear everywhere due to stupid translation software). I recommend that illegal sequences of bytes be treated as though each byte was a character (in the range 0x80-0xff). This would allow most ISO-8859-1 text to be passed as UTF-8 without any changes. Not everybody agrees with this idea as it would allow security problems, but I recommend that no software assign special meaning (ie as seperators) to any characters greater than 0x80).

Among the interesting properties of UTF-8 is that sorting this string produces the same ordering that sorting the UTF-32 would produce, that all continuation characters start with the bits 1 and 0 (making it easy to find the character divisions), and that no ASCII characters or control characters or 0xFE or 0xFF can be confused with any multi-byte characters (allowing UTF-8 strings to be passed through normal strcpy() and other byte-oriented string manipulation software).

I believe the only reason UTF-8 is not used everywhere is political correctness: some people think that it is unfair that English gets the shorter characters and that we should all have equal-sized characters to demonstrate world equality. In reality, due to the use of spaces, numbers, and embedded English words, almost every language in the world is shorter in UTF-8 than in UTF-16!

UTF-8 will probably begin to enjoy wider use because it's the default character set for XML.

Lots of information at http://www.cl.cam.ac.uk/~mgk25/unicode.html

UTF-16	Unicode	UTF-32	RFC 2044
UTF-7	Shift-JIS	Japanese Character Encoding Formats	Only Slightly a Geek Girl
Plan 9	Turkish Alphabet	RFC 2279	ASCII
UCS-2	Converting ASCII to UTF-8	Using Unicode on E2	Russian National Anthem
RFC	Unicode Support	Unicode 3.0	character set
Ken Thompson	Arabic	big-endian	XML