CharacterSet

As stated on WhatIsGlobalization, historically, computers have mostly been used in the United States. Thus, most have used English. The most popular character sets, ASCII and EBCDIC, are based on the English alphabet, which is actually the Latin alphabet. A CharacterSet is an array of glyphs, or pictures, that are keyed by a character code. For instance, the character code 65 in ASCII maps to the glyph 'A'. In the case of ASCII (ISO-646) and EBCDIC, there are only 128 positions (7 bits) in the array. This is barely enough to accommodate the upper and lower case alphabets, the variety of punctuation we use, and a small set of control characters needed for electronic transmission. It did allow 12 characters to be adapted to local usage; however, there are vastly more glyphs in the world than that.

The solution engineers using the IBM PC architecture took was "code pages." A code page is also an array of glyphs, just like ASCII, but 256 characters (8 bits-one byte) long. The additional 128 characters map to localized characters, such as the following variations on the letter 'A' used in various European languages: �ÀÁÂÃÄÅ. However, each code page maps to one particular locale's requirements. Code page 437 (CP437) is US English, CP860 is Portuguese, CP863 is Canadian French. This system works well for many scripts throughout Europe, North America and South America.

However, to support scripts with a large number of glyphs such as Japanese, which has thousands of characters, using the single byte character set (SBCS) system of code pages is wholly inadequate. The solution chosen was to use a multi-byte character set (MBCS) for CP932 Japanese. In this model, a number of bytes in the base page are marked as "shift" codes. A shift byte indicates the next byte should index into an extension page for the actual glyph.

MBCS is very difficult to program for as it is not clear whether a character is one or two bytes long. It is no longer trivial to do string operations such as finding the fourth character in a string. Long and error prone string manipulation algorithms must be used every time. Since it is second nature for many native English-speaking programmers to write programs as if they only had to deal with SBCS, much code will have to be adapted, thus increasing internationalization costs geometrically.

Unicode

To address this problem, as well as the growing need for a universal character set as economic globalization became prevalent, The Unicode Standard [ISBN0201616335] was developed. Unicode provides a unique code point for each grapheme in most modern and ancient languages. (A grapheme represents the underlying character rather than the glyph itself.) Unicode has the ability to use most numbers up to hexadecimal 10FFFD, the main exception being D800-DFFF, giving a potential 1,112,062 points in the Unicode space. How to store these code points is left to a number of character encodings.

UTF-8, a byte-oriented encoding, is probably the main Unicode character encoding used on the web, and has the following properties:

ASCII text can be treated as UTF-8 without a problem, as the 128 ASCII glyphs are stored as single bytes. This vastly simplifies the conversion of existing sites to UTF-8, and also allows legacy ASCII-only browsers to read English-language sites that are emitting UTF-8.
Non-ASCII glyphs are encoded as multiple bytes of variable length. Most languages with simple character sets fit into two bytes; languages with complex glyphs like Chinese take three. Some rarer characters take four.
Pattern matching is simplified by the fact that the larger encodings cannot contain the smaller characters within them: a two-byte character will never contain a valid ASCII byte, for instance. Thus, searching for a string can be done byte-wise without worrying about a match starting or ending in the middle of an encoded character.

Many modern languages and platforms, such as Windows NT, Windows 2000 and Java use UTF-16 internally.

UTF-16 uses multiples of two bytes to encode each character, and as such does not have synergies with ASCII.
UTF-16 allows simple pattern-matching, like UTF-8, as four-byte characters will never contain valid 2-byte characters.
In early versions of Unicode, UTF-16 did not need variable-length encodings, as there were only 65536 code points; however, this is no longer the case, and the reserved points D800-DFFF are used to encode surrogate pairs representing the higher-valued code points.
Since these code points are highly specialized, there is a risk of UTF-16 code harboring subtle bugs that will only manifest when surrogate pairs are used; as such, UTF-8 is probably the better bet for code with fewer users (like web software), since the number of complaints about bugs will be far greater.

For more about Unicode, see WikiPedia:Unicode, WikiPedia:UTF-8 and WikiPedia:UTF-16.

There are some errors in the UTF-16 part:

as far as UTF-16 is preferred within systems, no 4-byte encoding is used. Otherwise it has neither logical nor performance advantage.
there are no 4-byte-characters in UTF-16.
pattern matching is simpler in UTF-8 because it can use algorithms unchanged (e.g. Perl doesn't need to know whether strings are UTF-8 strings or single byte strings). UTF-8 multi-byte-characters do not contain ASCII characters, while UTF-16 may contain any type of characters, ASCII and control characters included.

-- HelmutLeitner

I'll address these in order, Helmut:

That used to be the case until Unicode 3.1. Then the number of code points was increased 17-fold, and UTF-16 became a variable-width encoding. It is precisely because many don't realize this that means UTF-16 systems may harbor bugs. UTF-32 is now the only fixed-width Unicode encoding. Feel free to read the Wikipedia pages I linked to to confirm this for yourself.

The main benefit of UTF-16 over UTF-8 is that it stores the more complex languages in two bytes rather than UTF-8's three. This is a score for perceived globalization, even if the space saving itself is pretty irrelevant in this day and age.

Every character in the code points U+010000 to U+10FFFD are stored in four bytes in UTF-16. UCS-2 may be what you're thinking of, as it expressly forbids encoding these code points, and is thus a fixed-width encoding. It also cannot express all of Unicode. Again, please feel free to check this from other sources.

Perl certainly does need to know whether strings are UTF-8 or not, and that's the reason it's a pain to use. Perl's pattern matcher appears to treat all strings as single-byte encodings, which means matching a range of variable-width characters is highly non-trivial. If you want to only match against ASCII code points, then sure, UTF-8 is easier, but I think that point is already made implicitly above.

The claim of "simple pattern-matching" is made against the specifics of the encoding, namely that one can tell by looking at a single byte (or byte pair, for UTF-16) whether one is at the start of a new code point or not, and hence one does not need to worry about explicitly finding code point start and end points when searching for patterns. This is not the case for other encoding schemes previously proposed, where for instance the last two bytes of a three-byte character could be a valid encoding for another code point.

-- ChrisPurcell

I have to agree on the issue of bugs with UTF-16 surrogate pairs; last year when I was writing a page-dump processing program in C# I had a whole series of problems with surrogate pairs in Mono's class library. They just plain didn't work all through the whole XML stack: the pairs were combined in the wrong way, detected in the wrong way, broke at buffer boundaries, etc. I'm not sure anything was actually right. :) Apparently those functions had just never actually gotten exercised before, so we got to be the guinea pigs finding and fixing the bugs.

Unfortunately UTF-8 isn't a magic bullet there either; MySQL's Unicode support for instance only allows a 3-byte subset of UTF-8, so you can only store 4-byte characters in a field with raw binary collation...

-- BrionVibber

Chris, I hope you will be not offended by what I say now. It is nothing personal but a deep frustation over the insensitiveness and sillyness of the majority of English speakers and programmers towards foreign languages and their characters. Take it as part of a rant. "Why should I discuss this here with you, in a system that you maintain and that does neither support any form of Unicode nor foreign characters properly? As you can see from ÖsterreichSeite (http://www.usemod.com/cgi-bin/mb.plÖsterreichSeite). It doesn't link. It doesn't sort. Search for Österreich finds only one page instead of two." -- HelmutLeitner

I may be missing the thrust of your argument, but Chris is working to make MeatballWiki as Unicode as possible given the crappy state of the art specifically because we want to fully include Meatball's audience of non-English speakers. -- SunirShah

Thank you for bringing this problem to my attention, Helmut. I can tell you immediately why it sorts where it does: Perl is sorting the strings as if they were encoded with ISO-8859, and the first byte of the two-byte UTF-8-encoded Ö just happens to be ISO-8859-encoded Ã, which of course sorts between A and B. I can't tell you when I'll fix that, but I will.

I can't tell you why it doesn't link, unfortunately. As far as I understood the UM engine, if it allowed the page to be created, it would form links. All I can think of off the top of my head is that Perl isn't seeing a word boundary before the Ö. Again, I can't tell you when I'll fix that, but I will.

I'm not sure which pages you were expecting Österreich to match. If this one, that's a bug in MySQL's text matching that can't cope with nested single quotes around a single word, and it affects all-ASCII search strings too. I can't promise you that I'll fix MySQL, and I hope you don't expect it of me.

Finally, I think the answer to the question, why should you discuss this here with me, in a system that I maintain and that does neither support any form of Unicode nor foreign characters properly, is precisely the same reason why I discuss this here with you, in a language that you persistently use incorrect grammar in: because we AssumeGoodFaith, that despite the mistakes on both sides, we are trying to communicate, learn, and improve. To build something together. I hope we continue to do that. -- ChrisPurcell

Chris, my rant and anger is over. So I don't feel a need to go into details. Just a few points. :-) (1) I'm fully aware that my English is faulty. If I would participate in a grammar discussion at all, I would assume that I'm the learner and you are the expert and I would hesitate to insist on my opinions. (2) My ProWiki engine fully supports foreign characters since 2001 (Unicode since 2004) and I can assure you that Perl doesn't need to know which encoding is used. (3) Of course in searching for "Österreich" I would expect to find this CharacterSet page. If searching is broken (one of the 8 fundamental wiki features) on the transition to MySQL, you probably should have not done it (from the perspective of a foreign user). Of course, an English programmer will see this as a minor issue of no importance. *But*, at the moment this engine is unusable for foreign language projects! BTW WikiPedia handles foreign characters, MySQL and Unicode seemingly without problems. -- HelmutLeitner

Excellent! I wasn't aware you were an expert in wikis and Unicode. My apologies if I was asserting untruths about Perl: I couldn't find anything to contradict them online. Perhaps you could help me here? I wish to know what regular expressions and/or character set methodology you use to get Perl to pattern-match non-ASCII UTF-8 page titles. The best I could achieve was <tt>(?:[A-Z]|\xc3[\x80-\x9e])</tt> for upper-case letters like Ö, and <tt>(?:[a-z]|\xc3[\x9f-\xbf])</tt> for lower-case. This is hardly what I would call "not needing to know what encoding is used", so I assume you have a better alternative? (I'm ignoring WikiPedia in this case, because it doesn't use CamelCase.) -- ChrisPurcell

Well, we've put a lot of work into that; we finally finished transitioning some of our largest sites (especially en.wikipedia.org) to Unicode only in mid-2005. MySQL didn't support Unicode at all when we started, and we had to have it treat everything as binary data to use UTF-8. This required a lot of fudging and transformation to get the MySQL-based search to work; incorrect case-folding and word-boundary detection had to be worked around by rolling things into ASCII-friendly garbage. (We now use a custom search backend based on Apache Lucene.) MySQL today supports Unicode, but only a subset, so we still can't transition to "native" Unicode support in MySQL... But it's probably enough for most people (who don't have a wiki in ancient Gothic script or one with dozens of pages on obscure rare Chinese characters...)

PHP unfortunately is still a pretty Unicode-hostile environment, and we had to write a bunch of annoying special-case code. Maybe Perl's better there, maybe not, but you probably do have to be careful about code that assumes locale encoding or other 8-bit-isms. -- BrionVibber

Brion, that's interesting and surprising! I always admire MediaWiki for its Unicode handling and assumed that the switches to MySQL and PHP had to do with better support for Unicode that these environments offer. I did not imagine that you had to work around such obstacles and edges. I wouldn't argue now that a Perl/filesystem approach is better. It is just simpler and lower level, so when you have problems it is pretty clear where they come from and how to work around them. -- HelmutLeitner

Chris, I'll publish the ProWiki source in two weeks (hopefully on Mar15, sourceforge, under the GPL), so you can draw from it. The foreign language parts are just a few lines of code. I also wouldn't say that I'm an expert on these issues, because I'm not generally interested in theory or all issues. I'm just pragmatic in my approach to support communities and have sufficiently working code on the simplest way possible. -- HelmutLeitner

I've fixed the "doesn't link" problem for ÖsterreichSeite by changing <tt>\b</tt> (Perl's word-boundary match) to <tt>(?<![A-Za-z])(?<!\xc3[\x80-\xbf])</tt>, i.e. I've had to hard-code what a word-boundary is in UTF-8. Not sure about the sorting problem, will need actual Perl support for that somehow.

Update: I've solved the problems in a "better" way in an experimental version of the script, using the advice on [Unicode-processing issues in Perl]. I've actually hit upon that page several times, but it was only this time around that I had sufficient knowledge of all the issues (e.g. Perl's <tt>\x{1f}</tt> actually emits illegal UTF-8 even though <tt>\x{100}</tt> and upwards don't, "for backwards compatibility") to be able to get it all working. You're exceptionally lucky if you only need a few lines of code to make Perl unicode-safe, as far as I can see. Using a database, or using CGI.pm, will hurt you.

I've also fixed the sorting problem. However, I'm not willing to use the experimental site until I've checked all the bugs have been worked out. For example, thanks to a CGI/Unicode compatibility bug (CGI does not treat input as UTF-8), the Username field was silently corrupting names like TëstÜser — but only when one did a preview. -- ChrisPurcell

CategoryGlobalization