bboard

Main » Hacking » Why is ROM translation so (technologically) hard? » New reply

Alert

You are about to bump an old thread. This is usually a very bad idea. Please think about what you are about to do before you press the Post button.

New reply
>>3459
Disable auto-<br>

Post help

Presentation

[b]…[/b] — bold type
[i]…[/i] — italic
[u]…[/u] — underlined
[s]…[/s] — ~~strikethrough~~

[code]…[/code] — code block
[spoiler]…[/spoiler] — spoiler block
[spoiler=…]…[/spoiler]
[source]…[/source] — colorcoded block, assuming C#
[source=…]…[/source] — colorcoded block, specific language^[which?]
[abbr=…]…[/abbr] — abbreviation
[color=…]…[/color] — set text color
[jest]…[/jest] — you're kidding
[sarcasm]…[/sarcasm] — you're not kidding

Links

[img]http://…[/img] — insert image
[url]http://…[/url]
[url=http://…]…[/url]
>>… — link to post by ID
[user=##] — link to user's profile by ID

Quotations

[quote]…[/quote] — untitled quote
[quote=…]…[/quote] — "Posted by …"
[quote="…" id="…"]…[/quote] — ""Post by …" with link by post ID

Embeds

[youtube]…[/youtube] — video ID only please

Thread review
Kawaoneechan	I'd damn well hope so considering the amount of research time I put in that post.
neologix	Can confirm firsthand what Kawa wrote about Chrono Trigger text stored in a tailored dictionary. One of the things I wrote in my nascent (and recently untouched) CT ROM parser is the ability to read the text data stored, and read early on in Geiger's notes about the compression/decompression method and it checked out.
Kawaoneechan	Yup, this is a byuuboard alright.
wareya	formally numberless conceptual-morphemic orthographical unit of the middle empire
Kakashi	Yeah, why can't we just say singular adopted logographic Chinese character?
wareya	love 2 unironically write the string of text "discrete kanji logogram"
Kawaoneechan	What's worse is, with all this talk of specific encoding schemes... repointing to make room for trivially-encoded strings is the easiest way, even on systems with "nasty" pointers. Seriously though, there's basically only one reason for Chrono Trigger to have dictionary lookup bytes and that's "ROM is expensive".
Kakashi	Posted by sureanem I'm not talking about any radicals or compounds, man. That sounds like something you'd want to advertise your sports drink has got lots of (or none at all, I wouldn't know - free radicals cause cancer, right?). You can fit more than two (2) alphabetic characters in the area of the sprite that would ordinarily be used to render one (1) discrete kanji logogram. Provided a kanji logogram is encoded with two bytes and an alphabetic character ordinarily would be encoded with one, this reduces the amount of space needed to encode a given sequence of alphabetic characters. So you basically don't know what you're talking about. Gotcha.
Kawaoneechan	Chrono Trigger's text encoding included a large swathe of dictionary lookup bytes, mapping one byte value to two or more characters, along with the general "insert name here" bytes. This would let entire parts of words like "pedia" be saved as one byte in the original text string, but decode into the full version for display in the dialogue box. The dictionary is not based on the top 30 of a given language, but tailored to the needs of the game. Unfortunately, I don't know what the Japanese version's text encoding is like, only that the names and dictionary lookups are there too, so I don't know how it handles kanji. Does CT have kanji?
CaptainJistuce	Because translation isn't an exact science. Some statements will be shorter, others longer. Some will have to be reworked to fit the game's output, which can change the length. Some games just have insane lovecraftian nightmares where one would expect the text engine code to be. I don't know what specific games you're thinking of, so I can only speak in vague generalities.
‮strfry("emanresu")	Posted by Kawa Unfortunately, the amount of 90s/00s games on consoles and handhelds that use Unicode in any form since Unicode's inception (1991) can be counted on one hand, and UCS-2 (as "classic UTF-16 aka "Unicode"" is properly called) is considered wasteful. Fun fact about dedicating tile space to digraphs: at least one Final Fantasy fan translation that I've seen did this, with about ten at most character values being digraphs like 'll', 'il', or 'th'. I'll bet biscuits to an asskicking that this was done primarily for the visual aspect, and that the actual text was still repointed to fit. Well, yeah, then, uh, case closed. Personally, I think UCS-2 should be called either Unicode or wchar_t, but "classic UTF-16" seems like a reasonable compromise to minimize confusion. So why is repointing so important, then? It's a slight improvement, but you sure could do without it if the ratios are as you say. How come it can preclude games from getting translated?
‮strfry("emanresu")	I'm not talking about any radicals or compounds, man. That sounds like something you'd want to advertise your sports drink has got lots of (or none at all, I wouldn't know - free radicals cause cancer, right?). You can fit more than two (2) alphabetic characters in the area of the sprite that would ordinarily be used to render one (1) discrete kanji logogram. Provided a kanji logogram is encoded with two bytes and an alphabetic character ordinarily would be encoded with one, this reduces the amount of space needed to encode a given sequence of alphabetic characters.
Kakashi	Posted by sureanem You can put more than two characters in a kanji. You're clearly trying to compare radicals to kanji compounds, which is just hilarious scrambling considering how radicals work. This isn't Hangul, buddy. $gf tells me you have too much time on your hands.
Kawaoneechan	Unfortunately, the amount of 90s/00s games on consoles and handhelds that use Unicode in any form since Unicode's inception (1991) can be counted on one hand, and UCS-2 (as "classic UTF-16 aka "Unicode"" is properly called) is considered wasteful. Fun fact about dedicating tile space to digraphs: at least one Final Fantasy fan translation that I've seen did this, with about ten at most character values being digraphs like 'll', 'il', or 'th'. I'll bet biscuits to an asskicking that this was done primarily for the visual aspect, and that the actual text was still repointed to fit.
‮strfry("emanresu")	Posted by KingMike Also, putting two characters on one kanji is a brilliant idea, you say? You know that games with lots of kanji (or more than a handful) usually use two bytes per kanji? So how is using two bytes to represent a two-character title saving space compared to one byte for one character? You can put more than two characters in a kanji. Anyhow, if you're using a variable-width encoding there's not much of a point, that's true. Also, I thought about using two characters on tile early in my "career" but I stopped when I realized it looks like shit. (other people say they're okay with limiting it to pairs like il and ll, but I think it still looks ugly when all other characters are evenly spaced) Even if it's consistently kerned? It should at least be readable, and better than gargantuan inter-letter spacing. Posted by Kawa Trick question: repointing is a lot of work regardless, even if it's as simple a format as the GBA. Maybe you have a banked system and each bank has its own list of pointers to its constituent strings. And then in one particular bank there's a list of pointers where each individual bank's list starts, and each of those pointers, unlike the GBA's 32, is only 16 bits wide. As for variable-width character encoding, what do you mean exactly? I feel like I have to ask just so we're all clear, considering earlier revelations. Edit: hah, didn't see KingMike's post. Again, it's not about screen real estate, but about storage space. The former wouldn't need pointer rework but UI element resizing and moving at worst. On a typical tilemap with 8x8 pixel tiles, kanji are almost always four tiles (2x2) in size simply because they're too intricate compared to kana and romaji. Yeah, it's about storage space all right, I just used the word "physically" poorly. By variable-width encoding, I mean an encoding like UTF-8 or modern UTF-16 where a character may be one or several bytes long, unlike an encoding such as ASCII, ISO 8859-1 (aka Latin-1), or classic UTF-16 (aka "Unicode") where one character is always the same amount of bytes (e.g. one char/wchar_t)
Kawaoneechan	Trick question: repointing is a lot of work regardless, even if it's as simple a format as the GBA. Maybe you have a banked system and each bank has its own list of pointers to its constituent strings. And then in one particular bank there's a list of pointers where each individual bank's list starts, and each of those pointers, unlike the GBA's 32, is only 16 bits wide. As for variable-width character encoding, what do you mean exactly? I feel like I have to ask just so we're all clear, considering earlier revelations. Edit: hah, didn't see KingMike's post. Again, it's not about screen real estate, but about storage space. The former wouldn't need pointer rework but UI element resizing and moving at worst. On a typical tilemap with 8x8 pixel tiles, kanji are almost always four tiles (2x2) in size simply because they're too intricate compared to kana and romaji.
KingMike	Also, putting two characters on one kanji is a brilliant idea, you say? You know that games with lots of kanji (or more than a handful) usually use two bytes per kanji? So how is using two bytes to represent a two-character title saving space compared to one byte for one character? Also, I thought about using two characters on tile early in my "career" but I stopped when I realized it looks like shit. (other people say they're okay with limiting it to pairs like il and ll, but I think it still looks ugly when all other characters are evenly spaced) Of course using VWF would fix the looking like shit part, but then you're back to the first problem. I've only done that with the RPG Maker games where I'm FORCED to fix text within the available space (due to the specific nature that it the games are to be editable in-game and thus compatible with the in-game engine. Also because the menu text is embedded within some custom programming language. And I sure don't want to spend more time fully reverse-engineering that language so I can write a re-encoder. See that's another thing that comes up in SNES games, I heard the Romancing SaGa games did that.)
‮strfry("emanresu")	Posted by Kawa Why would you not be able to move the pointers around? Some systems may have bankswitching limitations but that hardly means you can't do it. Also please address the part where your Japanese example sentence takes more bytes of storage than the English equivalent. You might have the pointers indirectly computed through some hell of lookup tables and offsets, so while technically possible it'd be a lot of work. I reckoned the developers would use something like old-style UTF-16 where everything is two bytes for simplicity. If they use a var-width character encoding, then getting the translations to fit without relocation shouldn't really be a big problem, right?
Kawaoneechan	Why would you not be able to move the pointers around? Some systems may have bankswitching limitations but that hardly means you can't do it. Also please address the part where your Japanese example sentence takes more bytes of storage than the English equivalent.
‮strfry("emanresu")	Posted by CaptainJistuce " So why isn't it used more? This site suggests it's used in RPGs, but why never in fan translations?" Because it is a fairly limited technique, and if you're going to be making new code anyways, either code a better compression system* or relocate the text to somewhere you can actually fit what you need. It does get used when a game already implements it... sometimes. I think I've seen fan translations that removed the dual-tile encoding to add in something better. It turns out that people capable of doing fan translations tend to also be fairly well-versed in what you can feasibly do with a system while it is running the game they're working on. Yes, there are exceptions where the script is stored in plain ASCII or the text to be translated is all present as uncompressed images, but... a lot of the time it requires a good deal of assembly and platform knowledge to even extract the script in the first place, and a good deal more to get it back into the game later. Sometimes you even have to rework chunks of the game engine(I recall one title where the text was printed in vertical text boxes on the right side of the screen, and the hackers had to completely replace the text drawing code to get something that was useful for english) Your fundamental assumption that rawmhaxxors don't know what they're doing and are leaving even the most basic optimizations on the table is... questionable. *Exception for when RAM limitations or available processing power prevent you from using something better. Well, all that makes sense. But how come it never gets used in games where you can't move the pointers around? I get that there are better compression methods, but those aren't used either. For Nintendo 64, the strings usually have printf-esque format specifiers and are stored uncompressed, so I'd imagine those issues you outline wouldn't be an issue for N64, PSX, and later. Not saying they're leaving optimizations on the table, of course they're not. There must be a good reason why it's not a good way to go at things.

Main » Hacking » Why is ROM translation so (technologically) hard? » New reply