Let there be Unicode

So a few weeks ago, I was reading this essay by Ramsey Nasser. I’ve debated back and forth several times as to whether Smile should keep its strings and identifiers as 8-bit characters, or whether they should be upgraded to full Unicode. It’s a tough question.

On the one hand, I’m trying to build a language that will be able to grow well with the needs of the future, and the future argues for Unicode. There are a lot of people out there, and not all of them speak English. Or read or write English. As Nasser notes, Arabic is poorly supported by, well, everything, and there’s only a few bajillion people out there speaking Arabic. (It so happens that Arabic is possibly a pathologically-bad worst case for programming language support, too, since it’s a proportional cursive writing system and not fixed-with, with concepts like initial, medial, and final forms, instead of a single letterform per phoneme, and for the icing on the cake, it writes in the dead opposite direction of most other languages on planet Earth.) And beyond Arabic there’s Chinese and Russian and Devanagari and Japanese and a thousand others, and a really good future-proof programming language ought to be able to support all that in a very natural, native kind of way.

And on the other hand, there’s also this nasty problem that the vast proportion of the text files you’ll bump into in developed countries are, at best, ASCII or ISO Latin-1. There are a handful of locale-specific encodings that are used, and UTF-8 is growing, but the Intarwebz have done a pretty good job of dragging everybody else’s symbologies down to fit in 26 alphabetic letters, for better or for worse. Français is losing its cedilles, español is losing its tilde, and really scary things are happening to languages like Russian and Hebrew and Greek to squeeze into the 8-bit world. Because so much is in 8-bit, it’s a bit disingenuous to present a File object that can read lines of text and then claim that a Unicode in-memory representation is the original data, because it isn’t; it’s necessarily mapped through some kind of encoding rules. A lot of coders are used to the notion that a character is an 8-bit value, and there’s a certain kind of trust in knowing that the 8-bit values you see in memory match the 8-bit values on disk.

I debated all that back and forth over the last few weeks, and I finally concluded that I have to build for the future. Smile is a language that needs to be designed to last the fifty years its predecessors lasted, and I can’t make it last if I don’t design it for globalized needs. Nasser’s article was the final thing that tipped the scales: Smile is now and ever will be Unicode, and will be pretty darned friendly to non-English speakers.

So on-and-off, I’ve been building the Unicode tables. Even though the current interpreter is written in C#, Smile is not a .NET language, and the interpreter will be ported to C++ at some point in the not-too-distant future. To support that, I need Unicode data tables, and lots of them: Case-folding tables, case-conversion tables, character-type tables, and special symbol-grouping tables that are really critical for Smile’s needs.

So I can now legitimately write this:

    出力 = Stdout
    出力.印刷 = Stdout.print

    挨拶 = new {
        English: "Hello!"
        français: "Bonjour!"
        español: "¡Hola!"
        日本語: "こんにちは!"

    if ユーザー == 59 then
        出力 印刷 挨拶.日本語

That’s not a perfect language conversion, since “if” and “then” and “new” are still English keywords (and a handful of base keywords like that have to remain reserved), but a Japanese person should be pretty comfortable reading that nonetheless. It’s far more non-English support than you get from most programming languages.

There’s an interesting problem, though, in a programming language that is designed around symbolic computation: When are two symbols equivalent? If all you have is ASCII, it’s pretty straightforward, but it gets a lot harder when you’re trying to decide if this “é” is equivalent to that “é”, or worse, if “é” is equivalent to “e”. So in Smile, there are certain rules:

  • Most letter characters from most languages are allowed in symbols. That means “e” and “ñ” and “ה” and “出” can all be used for variables, function names, properties, methods, quoted symbols — you name it, you can use it.
  • Accented characters are compared via their decomposed forms. So that means “é” is equivalent to “é”, no matter how you encode it, but it is not equivalent to “e” or “è” or “ë”.
  • You can’t mix language groups within a single symbol. This is because some regional Unicode forms have mixed alphabetics or characters from other nationalities, and we need to make sure that when you write “a” it’s actually “a” and not something else. So this means that while “出力” and “印刷” are perfectly acceptable symbol names, “ユーザー” is too, but “ユーザーID” is not allowed. Smile contains an internal table that says that “e” comes from the Latin family and “出” comes from the CJK family, and you can’t use them both in the same name: You have to pick a language and stick with it.

But so what? You can still almost exclusively code in your native tongue, and that’s a good thing.

So far, I’ve implemented proper support for European languages (anything using Latin-based lithographies, including English, French, Spanish, German, and so on), as well as Greek/Coptic, Cyrillic (Russian and friends), Hebrew, and Japanese (so far, only Katakana/Hiragana: No Kanji yet). This means, unfortunately, that Mr. Nasser’s Arabic is still left out: I don’t know Arabic well enough to implement support for it. But hey, it’s at least possible, and crowd-sourcing for the win: There’s room at Smile’s table for everybody to sit down, even if you have to bring your own chair.

It’ll probably take a while to finish this port. The data tables are being built by hand, all couple-hundred-thousand lines or so, and that takes a while. But a good modern language needs proper Unicode support, and I would be negligent in my duty to make a truly future-proof language if I didn’t make Smile able to do that.