Why is ROM translation so (technologically) hard?

0 users browsing Hacking.

Main » Hacking » Why is ROM translation so (technologically) hard?

Pages: 1 2 Next Last

Duck Penis

Posted on 19-06-19, 00:31

Stirrer of Shit
Post: #413 of 717
Since: 01-26-19

Last post: 1911 days
Last view: 1909 days

Oftentimes, when enquiring about (Japanese -> English) translation, you hear that a certain game can't be translated because of the pointers. As in, you need to physically relocate the string, because the current buffer it's in isn't long enough.

But why is this the case? Why can't you just edit the strings in-place? English is a very predictable language, and with a custom alphabet you could encode it very efficiently.

Take for instance the string: "ウィキペディアへようこそウィキペディアへようこそウィキペディアは誰でも編集できるフリー百科事典です". This is 49 characters long. The official English translation, "Welcome to Wikipedia, the free encyclopedia that anyone can edit.," is 64.

But the Japanese glyphs are physically bigger. What prevents you from just taking two characters in one glyph? The sprites are physically editable, and with such alterations it becomes trivial. "Welcome to Wikipedia, the free encyclopedia that anyone can edit." Say you gave the top 30 digrams (from this page) a glyph each.

Then this example string would collapse to "Welcome _ Wikip_ia, _e f_e _cyclop_ia __ _y_e c_ __." - 52 characters. If you have several thousand slots, then you could probably use the top 100 bigrams, and you get "W___ _ Wiki__a, _e f_e _cyc___a __ _y_e _n __." - 46 chars. You could get this down even further if you had a better bigram list that considered spaces, trigrams, etc, or even an editor which dynamically generates one based on the game script and tells you what compresses well and what poorly.

And even without all this, you could just abbreviate your writing: "Welcome to Wikipedia, a free lexicon you can edit" (49 chars, same as Japanese). Or take a page from the journalists' playbook: "Welcome –free editable lexicon WP" (35ch)

So why is translation for some platforms considered impossible? PSX is the big one, you hear a lot about games being untranslatable because of the tools not being there to find all the pointers and move them.

I mean, someone ought to have had this idea before and scrapped it for some reason, but why? Тоо fеw glурhs? Or do es the ker nin g beco rn e unbea rab ly u gly?

There was a certain photograph about which you had a hallucination. You believed that you had actually held it in your hands. It was a photograph something like this.

Kawaoneechan	Posted on 19-06-19, 07:09 Hi.
Natural Selection's gift to womankind Post: #271 of 603 Since: 10-29-18 Last post: 9 days Last view: 6 hours	Did you really just imply that using some weird digraph encoding with frequency analysis and custom tools and all that is easier than shifting some pointers around? Cos the difficulty of pointer shifting varies per system -- GBA is absolutely lovely for example. And what about player input like hero names?

Duck Penis

Posted on 19-06-19, 14:56

Stirrer of Shit
Post: #416 of 717
Since: 01-26-19

Last post: 1911 days
Last view: 1909 days

Posted by Kawa
Did you really just imply that using some weird digraph encoding with frequency analysis and custom tools and all that is easier than shifting some pointers around? Cos the difficulty of pointer shifting varies per system -- GBA is absolutely lovely for example.

And what about player input like hero names?

On some systems it ought to be easier, since finding the pointers is supposedly nigh-impossible.

The tooling wouldn't need to be custom per-game, since the basic operations are the same: gin up the character sprites which most efficiently encode the game script (knapsack problem), edit the game script to use them (knapsack problem again), then patch in the edited game script.

Player names would probably not get any digraph encodings and just have to deal with having the same max length in alphabetic letters and syllabics. An ambitious translator could put in code to try to merge letters as they are written, but it would be a lot of work.

There was a certain photograph about which you had a hallucination. You believed that you had actually held it in your hands. It was a photograph something like this.

Kawaoneechan

Posted on 19-06-19, 16:24

Catgirl Fanboi

Post: #272 of 603
Since: 10-29-18

Last post: 9 days
Last view: 6 hours

As an example of how simple pointer magic can be: the GBA has 32-bit pointers and anything in ROM starts with 0x08000000, because that's how the memory is mapped. So if you know the text is at location 0xBADF00 in the file, you know to look for the pointer 0x08BADF00, or the byte sequence 00 DF BA 08. This is almost universal for that system. You can now freely expand the ROM if needed, even going beyond 0xFFFFFF bytes, with the rare Bank Nine pointer.

I should know, I literally just checked.

Duck Penis

Posted on 19-06-19, 19:44

Stirrer of Shit
Post: #417 of 717
Since: 01-26-19

Last post: 1911 days
Last view: 1909 days

Well, yeah, then it's easy. But for some platforms, it can supposedly be much harder.

Seems like someone else had a similar idea:

I was aware they made a custom font which had only specific combinations of 2 letters and wondered how many different sentences would even be writable with that limited set, but that just allows for a 2:1 ratio.

For older games, it would also make the need for proportional fonts less pressing because you could vary the character density as long as you still had room to spare.

There was a certain photograph about which you had a hallucination. You believed that you had actually held it in your hands. It was a photograph something like this.

CaptainJistuce	Posted on 19-06-19, 21:10 Hi.
Custom title here Post: #523 of 1169 Since: 10-30-18 Last post: 71 days Last view: 7 hours	Congratulations. You just invented dual-tile encoding. --- In UTF-16, where available. ---

Kawaoneechan	Posted on 19-06-19, 21:21 Hi.
The Brickshitter™ Post: #273 of 603 Since: 10-29-18 Last post: 9 days Last view: 6 hours	Pokémon TCG on the GBC. Note how the player's name is wide. Note how the opponent's name is not.

creaothceann

Posted on 19-06-19, 21:24

Post: #155 of 460
Since: 10-29-18

Last post: 47 days
Last view: 1 day

Posted by sureanem
But the Japanese glyphs are physically bigger. What prevents you from just taking two characters in one glyph?

That's called VWF (variable-width (proportional) font). Multiple BG or sprite tiles are treated as a canvas onto which the characters are drawn at runtime.

My current setup: Super Famicom ("2/1/3" SNS-CPU-1CHIP-02) → SCART → OSSC → StarTech USB3HDCAP → AmaRecTV 3.10

Kawaoneechan

Posted on 19-06-19, 21:34

Dick Duck

Post: #274 of 603
Since: 10-29-18

Last post: 9 days
Last view: 6 hours

Posted by creaothceann
That's called VWF (variable-width (proportional) font). Multiple BG or sprite tiles are treated as a canvas onto which the characters are drawn at runtime.

Importantly, the glyphs aren't tile-aligned! Unlike the text in those PokéCard shots, every character is exactly as wide as it needs to be.

Unfortunately, this does not seem to be what sureanem meant. After all, the first paragraph is all about pointers, not screen real estate. So let's back that truck up a little, eh?

Duck Penis

Posted on 19-06-19, 21:40

Stirrer of Shit
Post: #418 of 717
Since: 01-26-19

Last post: 1911 days
Last view: 1909 days

Posted by CaptainJistuce
Congratulations. You just invented dual-tile encoding.

Well, I reckon it'd be a common technique, but I couldn't find anything.
So why isn't it used more? This site suggests it's used in RPGs, but why never in fan translations?

Posted by Kawa
Pokémon TCG on the GBC.

Note how the player's name is wide.
Note how the opponent's name is not.

How's that work? Do they store player names as wchar_t's but game text as chars?

Posted by creaothceann

Posted by sureanem
But the Japanese glyphs are physically bigger. What prevents you from just taking two characters in one glyph?

That's called VWF (variable-width (proportional) font). Multiple BG or sprite tiles are treated as a canvas onto which the characters are drawn at runtime.

No, at compile time. You have all the 26 letters of the alphabet, then fill up the remaining slots with di/trigraphs. The drawing routines see opaque monospace characters, but some of these "characters" contain multiple letters.

There was a certain photograph about which you had a hallucination. You believed that you had actually held it in your hands. It was a photograph something like this.

Kawaoneechan

Posted on 19-06-19, 22:14

Why couldn't you put the bunny back in the box?

Post: #275 of 603
Since: 10-29-18

Last post: 9 days
Last view: 6 hours

Posted by sureanem
How's that work? Do they store player names as wchar_t's but game text as chars?

What we see here is a complete misunderstanding of character data types. But this does bring us closer to the original topic of string length differences in translation requiring pointer adjustments!

A char is always one byte in size, and thus can only encode single-byte character sets. A wchar_t is literally a wide char but that's not a visual thing! Wide characters are 16-bit, if not 32. That's the difference. That's why you can use UTF-16 only if you use wchar. UTF-8 being naturally byte-sized can be stored in regular chars.

Japanese games from the 90's that have kanji tend towards Shift-JIS or EUC if they don't use their own custom encoding.

Your original statement that "ウィキペディアへようこそウィキペディアへようこそウィキペディアは誰でも編集できるフリー百科事典です" is 49 characters long is misleading. How many bytes does it take to store that sentence? In UTF-8 the answer is 149 bytes. In Shift-JIS it's an even 100, and in EUC it's 99 bytes. No encoding that is strictly one byte to one character can possibly store this entire sentence in 49 characters. Many SNES RPGs, just to pigeonhole, may have a custom encoding that fits the alphabet, numbers, and some symbols together with both sets of kana in just under 256 characters, then have special control characters to insert kanji. That kind of encoding might allow your sentence to pack down to 42 bytes for the kana, plus... let's say three bytes per kanji (kanji control char plus 16-bit char #) equals 63 bytes. And a 64th for the string terminator.

What a coincidence, that's almost as long as the translation! Even though one of them has punctuation and the other does not.

Ironically, that leaves only screen real estate -- their visual lengths -- as an issue.

I just looked into PokéCards bee-tee-dubbs, and even though characters are either 4x8 or 8x8, it does use a proper VWF engine. As in, there is a dedicated canvas area in the tileset. But it also remembers which characters/pairs were already drawn so "KAWA is crazy about Pokémon and and Pokémon card collecting!" appears in the tileset as "KAW is crazy about Pokémon and card collecting!" Also to my surprise, the game uses near-ASCII! That line is stored as "\x09 is crazy about Pok`mon.and Pok`mon card collecting!". It's that 0x09 that makes it write the player's name and that specifically draws in 8x8 characters. Why? I don't know.

CaptainJistuce

Posted on 19-06-20, 01:13

Custom title here

Post: #524 of 1169
Since: 10-30-18

Last post: 71 days
Last view: 7 hours

" So why isn't it used more? This site suggests it's used in RPGs, but why never in fan translations?"
Because it is a fairly limited technique, and if you're going to be making new code anyways, either code a better compression system* or relocate the text to somewhere you can actually fit what you need.

It does get used when a game already implements it... sometimes. I think I've seen fan translations that removed the dual-tile encoding to add in something better.
It turns out that people capable of doing fan translations tend to also be fairly well-versed in what you can feasibly do with a system while it is running the game they're working on.
Yes, there are exceptions where the script is stored in plain ASCII or the text to be translated is all present as uncompressed images, but... a lot of the time it requires a good deal of assembly and platform knowledge to even extract the script in the first place, and a good deal more to get it back into the game later. Sometimes you even have to rework chunks of the game engine(I recall one title where the text was printed in vertical text boxes on the right side of the screen, and the hackers had to completely replace the text drawing code to get something that was useful for english)

Your fundamental assumption that rawmhaxxors don't know what they're doing and are leaving even the most basic optimizations on the table is... questionable.

*Exception for when RAM limitations or available processing power prevent you from using something better.

--- In UTF-16, where available. ---

Kakashi	Posted on 19-06-20, 11:56 Hi.
Post: #135 of 210 Since: 10-29-18 Last post: 2024 days Last view: 1996 days	I honestly think someone on this board has constantly got people searching for shitty diapers when it's quite obvious that they've just farted.

Duck Penis

Posted on 19-06-20, 12:17

Stirrer of Shit
Post: #420 of 717
Since: 01-26-19

Last post: 1911 days
Last view: 1909 days

Posted by CaptainJistuce
" So why isn't it used more? This site suggests it's used in RPGs, but why never in fan translations?"
Because it is a fairly limited technique, and if you're going to be making new code anyways, either code a better compression system* or relocate the text to somewhere you can actually fit what you need.

It does get used when a game already implements it... sometimes. I think I've seen fan translations that removed the dual-tile encoding to add in something better.
It turns out that people capable of doing fan translations tend to also be fairly well-versed in what you can feasibly do with a system while it is running the game they're working on.
Yes, there are exceptions where the script is stored in plain ASCII or the text to be translated is all present as uncompressed images, but... a lot of the time it requires a good deal of assembly and platform knowledge to even extract the script in the first place, and a good deal more to get it back into the game later. Sometimes you even have to rework chunks of the game engine(I recall one title where the text was printed in vertical text boxes on the right side of the screen, and the hackers had to completely replace the text drawing code to get something that was useful for english)

Your fundamental assumption that rawmhaxxors don't know what they're doing and are leaving even the most basic optimizations on the table is... questionable.

*Exception for when RAM limitations or available processing power prevent you from using something better.

Well, all that makes sense. But how come it never gets used in games where you can't move the pointers around? I get that there are better compression methods, but those aren't used either. For Nintendo 64, the strings usually have printf-esque format specifiers and are stored uncompressed, so I'd imagine those issues you outline wouldn't be an issue for N64, PSX, and later.

Not saying they're leaving optimizations on the table, of course they're not. There must be a good reason why it's not a good way to go at things.

There was a certain photograph about which you had a hallucination. You believed that you had actually held it in your hands. It was a photograph something like this.

Kawaoneechan	Posted on 19-06-20, 12:52 Hi.
Trash Post: #276 of 603 Since: 10-29-18 Last post: 9 days Last view: 6 hours	Why would you not be able to move the pointers around? Some systems may have bankswitching limitations but that hardly means you can't do it. Also please address the part where your Japanese example sentence takes more bytes of storage than the English equivalent.

Duck Penis

Posted on 19-06-20, 19:25

Stirrer of Shit
Post: #422 of 717
Since: 01-26-19

Last post: 1911 days
Last view: 1909 days

Posted by Kawa
Why would you not be able to move the pointers around? Some systems may have bankswitching limitations but that hardly means you can't do it.

Also please address the part where your Japanese example sentence takes more bytes of storage than the English equivalent.

You might have the pointers indirectly computed through some hell of lookup tables and offsets, so while technically possible it'd be a lot of work.

I reckoned the developers would use something like old-style UTF-16 where everything is two bytes for simplicity. If they use a var-width character encoding, then getting the translations to fit without relocation shouldn't really be a big problem, right?

There was a certain photograph about which you had a hallucination. You believed that you had actually held it in your hands. It was a photograph something like this.

KingMike

Posted on 19-06-20, 19:35

Post: #22 of 36
Since: 12-21-18

Last post: 1355 days
Last view: 263 days

Also, putting two characters on one kanji is a brilliant idea, you say?
You know that games with lots of kanji (or more than a handful) usually use two bytes per kanji? So how is using two bytes to represent a two-character title saving space compared to one byte for one character?

Also, I thought about using two characters on tile early in my "career" but I stopped when I realized it looks like shit.
(other people say they're okay with limiting it to pairs like il and ll, but I think it still looks ugly when all other characters are evenly spaced)
Of course using VWF would fix the looking like shit part, but then you're back to the first problem.
I've only done that with the RPG Maker games where I'm FORCED to fix text within the available space (due to the specific nature that it the games are to be editable in-game and thus compatible with the in-game engine. Also because the menu text is embedded within some custom programming language. And I sure don't want to spend more time fully reverse-engineering that language so I can write a re-encoder. See that's another thing that comes up in SNES games, I heard the Romancing SaGa games did that.)

Kawaoneechan

Posted on 19-06-20, 19:45 (revision 1)

Off Like a Shot

Post: #277 of 603
Since: 10-29-18

Last post: 9 days
Last view: 6 hours

Trick question: repointing is a lot of work regardless, even if it's as simple a format as the GBA. Maybe you have a banked system and each bank has its own list of pointers to its constituent strings. And then in one particular bank there's a list of pointers where each individual bank's list starts, and each of those pointers, unlike the GBA's 32, is only 16 bits wide.

As for variable-width character encoding, what do you mean exactly? I feel like I have to ask just so we're all clear, considering earlier revelations.

Edit: hah, didn't see KingMike's post. Again, it's not about screen real estate, but about storage space. The former wouldn't need pointer rework but UI element resizing and moving at worst. On a typical tilemap with 8x8 pixel tiles, kanji are almost always four tiles (2x2) in size simply because they're too intricate compared to kana and romaji.

Duck Penis

Posted on 19-06-20, 19:53

Stirrer of Shit
Post: #423 of 717
Since: 01-26-19

Last post: 1911 days
Last view: 1909 days

Posted by KingMike
Also, putting two characters on one kanji is a brilliant idea, you say?
You know that games with lots of kanji (or more than a handful) usually use two bytes per kanji? So how is using two bytes to represent a two-character title saving space compared to one byte for one character?

You can put more than two characters in a kanji. Anyhow, if you're using a variable-width encoding there's not much of a point, that's true.

Also, I thought about using two characters on tile early in my "career" but I stopped when I realized it looks like shit.
(other people say they're okay with limiting it to pairs like il and ll, but I think it still looks ugly when all other characters are evenly spaced)

Even if it's consistently kerned? It should at least be readable, and better than gargantuan inter-letter spacing.

Posted by Kawa
Trick question: repointing is a lot of work regardless, even if it's as simple a format as the GBA. Maybe you have a banked system and each bank has its own list of pointers to its constituent strings. And then in one particular bank there's a list of pointers where each individual bank's list starts, and each of those pointers, unlike the GBA's 32, is only 16 bits wide.

As for variable-width character encoding, what do you mean exactly? I feel like I have to ask just so we're all clear, considering earlier revelations.

Edit: hah, didn't see KingMike's post. Again, it's not about screen real estate, but about storage space. The former wouldn't need pointer rework but UI element resizing and moving at worst. On a typical tilemap with 8x8 pixel tiles, kanji are almost always four tiles (2x2) in size simply because they're too intricate compared to kana and romaji.

Yeah, it's about storage space all right, I just used the word "physically" poorly.

By variable-width encoding, I mean an encoding like UTF-8 or modern UTF-16 where a character may be one or several bytes long, unlike an encoding such as ASCII, ISO 8859-1 (aka Latin-1), or classic UTF-16 (aka "Unicode") where one character is always the same amount of bytes (e.g. one char/wchar_t)

There was a certain photograph about which you had a hallucination. You believed that you had actually held it in your hands. It was a photograph something like this.

Kawaoneechan

Posted on 19-06-20, 22:11

Now with more self-mockery

Post: #278 of 603
Since: 10-29-18

Last post: 9 days
Last view: 6 hours

Unfortunately, the amount of 90s/00s games on consoles and handhelds that use Unicode in any form since Unicode's inception (1991) can be counted on one hand, and UCS-2 (as "classic UTF-16 aka "Unicode"" is properly called) is considered wasteful.

Fun fact about dedicating tile space to digraphs: at least one Final Fantasy fan translation that I've seen did this, with about ten at most character values being digraphs like 'll', 'il', or 'th'. I'll bet biscuits to an asskicking that this was done primarily for the visual aspect, and that the actual text was still repointed to fit.

Pages: 1 2 Next Last

Main » Hacking » Why is ROM translation so (technologically) hard?

[Your ad here? Why not!]

bboard