Everything2
Near Matches
Ignore Exact
Full Text
Everything2

UTF-16

created by tres equis

(idea) by spitzak (3.8 y) (print)   ?   (I like it!) Sat Dec 01 2001 at 0:11:29

Unicode encoded as two bytes per character. The obvioius way to do this is to put the bottom 16 bits into the two bytes (high byte first so sorting order is preserved), and this is called UCS-2. When people realized that (due to Chinese, mostly) more than 65,536 characters were needed, they came up with this bastard encoding, rather than using UTF-8, which is a sensible encoding. MicroSoft uses this encoding in their stuff, sigh.

UTF-16 can encoded Unicode up to 0x10ffff. All codes less than 0xffff but not in the range 0xd800-0xdfff are encoded high byte first, low byte second.

The "characters" 0xd800-0xdfff are called "surrogate characters" and must appear in pairs. These are combined in a complex way to produce the characters in the range 0x10000 through 0x10ffff. They also defeat the only plausible advantage of UTF-16, which is that the characters are the same size!

Don't use this, it is just proof that the standards people have their heads up their asses. Use UTF-8 instead.


(thing) by tongpoo (3.1 mon) (print)   ?   (I like it!) Thu Mar 27 2003 at 23:07:29

This Unicode Transformation Format serializes each Unicode value as two bytes, or in case of values above U+FFFF, four bytes (a surrogate pair). A UTF-16 can be either in little-endian or big-endian format. An initial byte sequence called the byte order mark (BOM) is required for UTFs. The BOM is U+FEFF ZERO WIDTH NO-BREAK SPACE (therefore it doesn't do anything) and it can have several different byte sequences: To prevent ambiguity, U+FFFE is not defined.

The Unicode codespace is allocated into several areas, one being the Surrogate Area, which consists of 1,024 high surrogates (U+D800 - U+DBFF) and 1,024 low surrogates (U+DC00 - U+DFFF). A high surrogate, followed by a low surrogate, forms a surrogate pair that represents a single Unicode scalar value. Approximately one million surrogate pairs are possible, and their values can be derived from this formula:
65536 + ((highSurrogate & 1023) << 10) + (lowSurrogate & 1023)
In plain English, it takes the the last ten binary digits from both surrogates, concatinates those, and adds 216 to that number. As of Version 3.0, none of the surrogate pairs have been assigned.

UTF-16 on average can save about a byte per character over UTF-8 when encoding East Asian text.

Sources (PDF and PowerPoint files):
  • "The Unicode Standard, Version 3.0" Section 2.3, Encoding Forms.
    http://www.Unicode.org/book/ch02.pdf

  • "The Unicode Standard, Version 3.0" Section 3.7, Surrogates.
    http://www.Unicode.org/book/ch03.pdf

  • "The Unicode Standard, Version 3.0" Section 5.4, Handling Surrogate Pairs.
    http://www.Unicode.org/book/ch05.pdf

  • "Surrogate Support in Microsoft Products."
    http://www.Unicode.org/iuc/iuc18/papers/a8.ppt

printable version
chaos

UTF-8 UTF-32 UCS-2 Unicode
UTF-7 Unicode Transformation Format UCS-4 big-endian
Mule-UCS Making your own nuclear car bomb Specials surrogate pair
little-endian Surrogates Area Cosmic Chasm byte order mark
Tron character set NULL terminator
Y'know, if you log in, you can write something here, or contact authors directly on the site. Create a New User if you don't already have an account.
  Epicenter
Login
Password

password reminder
register

Everything2 Help

Cool Staff Picks
After stirring Everything, these nodes rose to the top:
Deals well with ambiguity: a savagely long writeup about why boys are not like girls and other things
Webster 1913 is a fake
Do female homosexuals have it easier than male homosexuals?
Genetic drift
Airport security
Principle A
Ashbal Saddam
Gazebo
Graves Registration
Send me a check for eleven cents
Ram Dass
Rapid Uncontrollable Descent
I've had better hugs from wind gusts and dead people
New Writeups
aneurin
You pays your money and you takes your choice(idea)
shaogo
July 20, 2008(log)
Glowing Fish
Tualatin River(place)
The Jacket
Words of Advice(idea)
John_Fox
Good Intentions Gone Wrong(person)
Heitah
Posthumous Oscar(thing)
ignis_glaciesque
University of South Florida(place)
ignis_glaciesque
Flogstaskriket(idea)
liveforever
Caesar's last breath(idea)
dagnyswaggart
she wants to believe(personal)
antigravpussy
he doesn't know, but her eyes widen too far(thing)
dagnyswaggart
Wild tides guard her secrets(poetry)
Lord Brawl
Caesar's last breath(poetry)
locke baron
Forgotten things in space(fiction)
sitaraika
Colours(idea)
E2 is a by-product of the existence of The Everything Development Company