Logo Pending

Rules for UTF-8 in SCI11+

  1. It won’t work in ScummVM yet, as nothing uses it yet so I see no reason to add it. Most of the rest of SCI11+’s gimmicks do though.
  2. The internal “draw a string” function, used to write literally anything on to the screen, is where the magic happens: if the current port’s current font has more than 256 glyphs in it, the input string is interpreted as UTF-8. If it does not, things work exactly as usual.
  3. Because combining characters and glyph substitution are not supported and general punctuation like “” and are all the way in the 20002044 range, the General Punctuation block’s glyphs take the place of Combining Diacritical Marks as 03000344.
  4. Similarly, CJK Symbols, Hiragana, and Katakana are moved from 300030FF to 020002FF, where some Latin Extended-B, IPA Extensions, and Spacing Modifiers should go.
  5. Those last two points apply to the font data, not the actual text.
  6. The new kernel functions UTF8to16 and UTF16to8 will always consider their inputs to be in Unicode, no matter what the current port’s current font says. Unless you built an SCI11+ with UTF-8 support disabled, in which case none of the above applies and all these two functions do is turn 8-bit values into 16-bit.
  7. The kernel function to turn a string lower or upper case, StrCase, unlike the two I just described, will check the current font and act like it used to same as the “draw a string” function.
  8. The functions to get the lower or upper case version of a character that StrCase ends up using, tolower and toupper, have been extended to cover the full 256-character range. Several maps are included and one can be chosen at build time. We have maps for code page 437, Win-1252, ISO 8859-1, and a fair bit of Unicode.
  9. In general, SCI11+ can be considered to use Unicode 1.1 on account of SCI 1.001.100 dating from 1993, going by the version numbers and release dates for Freddy Pharkas (1.001.095) nd Leisure Suit Larry 6 (1.001.115).
[ , , ] Leave a Comment

Noses are a waste of bandwidth

<Maxane> But then again, rather not ;-p

So yeah. That’s me referencing some old IRC thing moments before I decided to write this post, but it got me thinking. There’s the joke that adding a nose to your classic :) smileys is a waste of bandwidth, but how does it stack up against modern emoji?

This is just a short observation, mind you. Done in minutes.

Display Characters UTF8 bytes
:) 2 3A 29
:-) 3 3A 2D 29
🙂 1* F0 9F 99 82

There’s really not much to it, innit?

* depending on how you define ’em.

[ ] Leave a Comment

On fonts

SCI, being rooted in MS-DOS and from a time before Unicode (fun fact, the first draft proposal dates back to 1988, when King’s Quest 4 came out as the first SCI game), SCI uses an 8-bit string format. That is, each character in a string is one byte, and that’s all it can be. Making strings one of very few standard data types in SCI that aren’t 16-bits and requiring a dedicated kernel call to manipulate (as seen in KQ4 Copy Protection) but that’s not the point here.

American releases of SCI games would normally have font resources ranging only up to 128 characters, with the Sierra logo at 0x01 and ~ at 0x7E. Only caring about newlines, all other characters are considered printable. European releases would include usually not 256 but 226 characters, up to ß at 0xE1, basically copying code page 437 but leaving out the graphical elements among others. This means, of course, that a Russian translation of such a game would require another custom font copying code page 866 instead.

And then there’s the whole thing where SCI Companion uses the Win-1252 code page (it’s not exactly a Unicode application) which makes translated games look pretty wild:

Ich glaube Dir gern, daá Du das tun m”chtest!

That doesn’t look quite right. That’s supposed to be “Ich glaube Dir gern, daß Du das tun möchtest!” And indeed, comparing things between DOS-437 and Win-1252, we see that á and ß are both encoded in the same byte value.

That’s the kind of bullshit Unicode was made for, isn’t it?

So what I did for my SCI11+ project, of which one version is used in The Dating Pool, is to add optional basic Unicode support, so you can write text data in UTF-8 and not have to worry about things all that much. There are however two major problems with this idea. One of them is that SCI Companion is not a Unicode-aware program, so you can’t use that to write the text data. That’s easily solved with external resource editors that are. The second is more insidious — the fonts.

What I found out about SCI font resources is this: their header fields are way too wide.

typedef struct
  word lowChar;
  word highChar;
  word pointSize;
  word charRecs[0];
} Font;

lowChar is always zero, but the interpreter does acknowledge it. highChar is, as discussed above, always 128, 226, or 256. The fact that it’s exclusive is basically the only reason to have it not be a char-type value.

.if bx < es:[si].highChar && bx >= es:[si].lowChar

See? Exclusive. But the takeaway here is obvious. SCI font files can contain up to 65,535 characters. That’s enough to cover the Basic Multilingual Plane. As such, I’ve added handling of double- and triple-byte UTF-8 sequences to SCI11+. I’ve tested it, too:

Switching to another, non-extended font, I expected to see "Tend ", and that’s exactly what I got. The routine I linked to would decode an ō and dutifully pass it on down to StdChar, which would see that 0x14D is way higher than 226 and simply draw a blank.

(Now, between the first draft of this post and its publication, I’ve further enhanced this system to not decode anything if the font has fewer than 256 characters, falling back to code page 437 or whatever, just not doing anything special.)

That leaves one last issue, which is mostly a matter of wasted space. I like my quotation marks to be proper curly, and in Win-1252 as The Dating Pool uses (because why shouldn’t it?) this is easy — just draw a  and in the font at 0x93 and 0x94,  and be done with it. But in Unicode, these two characters are part of the General Punctuation block, which starts all the way at U+2000. That would mean defining up to that many dummy characters. A two-byte pointer, two size bytes, and a single byte with at best one bit set per character.

That’s bullshit.

As such, I’d propose to cheat like hell and move the General Punctuation block so it covers the much earlier Combining Diacritical Marks block. It’d be way too much of a nightmare to support those. So while measuring and drawing, detect if you’re in the 0x2000 to 0x206F range, subtract 0x2000, add 0x300, and use that character instead. Or have the custom resource tools  that we’d need anyway do it.

(Again, just before publication, I came up with an idea to have a new font generation tool that takes a bitmap of the font and converts it. The trick for space-saving is that it would recognize graphics it’d already processed and simply place a pointer to the first one. Instead of five bytes per dummy, it’d use only two for all but the first. Savings!)

(Update: I made the thing.)

At any rate, your input is appreciated.

Except for yours, Covarr.

[ , ] Leave a Comment