Logo Pending

Noses are a waste of bandwidth

<Maxane> But then again, rather not ;-p

So yeah. That’s me referencing some old IRC thing moments before I decided to write this post, but it got me thinking. There’s the joke that adding a nose to your classic :) smileys is a waste of bandwidth, but how does it stack up against modern emoji?

This is just a short observation, mind you. Done in minutes.

Display Characters UTF8 bytes
:) 2 3A 29
:-) 3 3A 2D 29
🙂 1* F0 9F 99 82

There’s really not much to it, innit?

* depending on how you define ’em.

[ ] Leave a Comment

On fonts

SCI, being rooted in MS-DOS and from a time before Unicode (fun fact, the first draft proposal dates back to 1988, when King’s Quest 4 came out as the first SCI game), SCI uses an 8-bit string format. That is, each character in a string is one byte, and that’s all it can be. Making strings one of very few standard data types in SCI that aren’t 16-bits and requiring a dedicated kernel call to manipulate (as seen in KQ4 Copy Protection) but that’s not the point here.

American releases of SCI games would normally have font resources ranging only up to 128 characters, with the Sierra logo at 0x01 and ~ at 0x7E. Only caring about newlines, all other characters are considered printable. European releases would include usually not 256 but 226 characters, up to ß at 0xE1, basically copying code page 437 but leaving out the graphical elements among others. This means, of course, that a Russian translation of such a game would require another custom font copying code page 866 instead.

And then there’s the whole thing where SCI Companion uses the Win-1252 code page (it’s not exactly a Unicode application) which makes translated games look pretty wild:

Ich glaube Dir gern, daá Du das tun m”chtest!

That doesn’t look quite right. That’s supposed to be “Ich glaube Dir gern, daß Du das tun möchtest!” And indeed, comparing things between DOS-437 and Win-1252, we see that á and ß are both encoded in the same byte value.

That’s the kind of bullshit Unicode was made for, isn’t it?

So what I did for my SCI11+ project, of which one version is used in The Dating Pool, is to add optional basic Unicode support, so you can write text data in UTF-8 and not have to worry about things all that much. There are however two major problems with this idea. One of them is that SCI Companion is not a Unicode-aware program, so you can’t use that to write the text data. That’s easily solved with external resource editors that are. The second is more insidious — the fonts.

What I found out about SCI font resources is this: their header fields are way too wide.

typedef struct
  word lowChar;
  word highChar;
  word pointSize;
  word charRecs[0];
} Font;

lowChar is always zero, but the interpreter does acknowledge it. highChar is, as discussed above, always 128, 226, or 256. The fact that it’s exclusive is basically the only reason to have it not be a char-type value.

.if bx < es:[si].highChar && bx >= es:[si].lowChar

See? Exclusive. But the takeaway here is obvious. SCI font files can contain up to 65,535 characters. That’s enough to cover the Basic Multilingual Plane. As such, I’ve added handling of double- and triple-byte UTF-8 sequences to SCI11+. I’ve tested it, too:

Switching to another, non-extended font, I expected to see "Tend ", and that’s exactly what I got. The routine I linked to would decode an ō and dutifully pass it on down to StdChar, which would see that 0x14D is way higher than 226 and simply draw a blank.

(Now, between the first draft of this post and its publication, I’ve further enhanced this system to not decode anything if the font has fewer than 256 characters, falling back to code page 437 or whatever, just not doing anything special.)

That leaves one last issue, which is mostly a matter of wasted space. I like my quotation marks to be proper curly, and in Win-1252 as The Dating Pool uses (because why shouldn’t it?) this is easy — just draw a  and in the font at 0x93 and 0x94,  and be done with it. But in Unicode, these two characters are part of the General Punctuation block, which starts all the way at U+2000. That would mean defining up to that many dummy characters. A two-byte pointer, two size bytes, and a single byte with at best one bit set per character.

That’s bullshit.

As such, I’d propose to cheat like hell and move the General Punctuation block so it covers the much earlier Combining Diacritical Marks block. It’d be way too much of a nightmare to support those. So while measuring and drawing, detect if you’re in the 0x2000 to 0x206F range, subtract 0x2000, add 0x300, and use that character instead. Or have the custom resource tools  that we’d need anyway do it.

(Again, just before publication, I came up with an idea to have a new font generation tool that takes a bitmap of the font and converts it. The trick for space-saving is that it would recognize graphics it’d already processed and simply place a pointer to the first one. Instead of five bytes per dummy, it’d use only two for all but the first. Savings!)

(Update: I made the thing.)

At any rate, your input is appreciated.

Except for yours, Covarr.

[ , ] Leave a Comment