Hugh's ramblings: Unicode

April 27, 2003

Unicode

Tim Bray has a rather wonderful exposition of Unicode and its UTF encodings: Characters vs. Bytes:

Processing UTF-8 characters sequentially is about as efficient, for practical purposes, as any other encoding.
There is one exception: you can't easily index into a buffer. If you need the 27th character, you're going to have to run through the previous twenty-six characters to figure out where it starts... UTF-8 also has the advantage that null-termination, and all the old routines like strcpy, strncpy and their friends, which in practice are insanely efficient in terms of space and time, work just fine.

Looking forward to the next instalment, wherein it sounds like he'll have some harsh words for the Unicode-is-16-bit-chars brigade.

In practice, character encoding will only give you problems if you ignore it. Don't ignore it: at each place in your application, just define which encoding you're working with (especially if you're looking at data 'on the wire' or on disk). Is this a UTF8 byte array? An ISO-8859-1 byte array? Is it a string of BMP Unicode? Or something more exotic? It doesn't matter what, as long as you know what. If there's ambiguity (about the codepoint encoding, or higher-level encodings: eg. if you don't know whether a string is XML entity-encoded or not, as I saw recently), your application is broken.

Now, someone please, give fonts the same treatment. I had such problems with Windows fonts last time I needed to deal with Asian characters. Has it got better?

Posted by hpyle at 08:29 PM