The Unicode (or UTF-32) character, up to 31 bits, is encoded as follows:

Unicode (Hex).... Byte1... Byte2... Byte3... Byte4... Byte5... Byte6...
00000000-0000007f 0xxxxxxx
00000080-000007ff 110xxxxx 10xxxxxx
00000800-0000ffff 1110xxxx 10xxxxxx 10xxxxxx
00080000-001fffff 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
00200000-03ffffff 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxx
04000000-7fffffff 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxx 10xxxxx

Each 'x' indicates a single bit from the Unicode encoding of the character.

You must use the shortest possible encoding for each character, so a given character can only appear with one encoding, to avoid security problems with unwanted characters getting past filtering. UTF-16 "surrogate characters" are not allowed (though they appear everywhere due to stupid translation software). I recommend that illegal sequences of bytes be treated as though each byte was a character (in the range 0x80-0xff). This would allow most ISO-8859-1 text to be passed as UTF-8 without any changes. Not everybody agrees with this idea as it would allow security problems, but I recommend that no software assign special meaning (ie as seperators) to any characters greater than 0x80).

Among the interesting properties of UTF-8 is that sorting this string produces the same ordering that sorting the UTF-32 would produce, that all continuation characters start with the bits 1 and 0 (making it easy to find the character divisions), and that no ASCII characters or control characters or 0xFE or 0xFF can be confused with any multi-byte characters (allowing UTF-8 strings to be passed through normal strcpy() and other byte-oriented string manipulation software).

I believe the only reason UTF-8 is not used everywhere is political correctness: some people think that it is unfair that English gets the shorter characters and that we should all have equal-sized characters to demonstrate world equality. In reality, due to the use of spaces, numbers, and embedded English words, almost every language in the world is shorter in UTF-8 than in UTF-16!

UTF-8 will probably begin to enjoy wider use because it's the default character set for XML.

Lots of information at http://www.cl.cam.ac.uk/~mgk25/unicode.html