Announcing the bsnes history kit

0 users browsing Emulation. | 2 bots

Main » Emulation » Announcing the bsnes history kit

Near	Posted on 19-03-17, 01:47 Hi.
Burned-out Genius Developer Post: #18 of 51 Since: 10-30-18 Last post: 1211 days Last view: 1134 days	For whatever it's worth, I've no qualms if you wanna just extract the RARs, and recompress them with XZ or whatever. Worst case, throw the RAR inside of the new XZ archive that ends up on Git if people really desperately want a bit-for-bit identical historical copy.

Screwtape

Posted on 19-03-18, 09:11

Full mod

Post: #164 of 443
Since: 10-30-18

Last post: 892 days
Last view: 89 days

When I started this, I was quite happy with the unofficial git repo, I just wanted to try and extend it backward without too much manual labour. However, I've come to appreciate the usefulness of original, unmodified artefacts, even when (especially when) a human-edited summary is available. For example, the datestamps in the unofficial git repo are about right, but usually there's a few hours between byuu posting a WIP and me waking up and committing it, and sometimes if I was busy doing something else I have a stack of WIPs built up and commit them all, one after another on the same day. Having the original WIP tarballs available means I can find out the *actual* time they were made, instead of hand-waving.

Normally when I add a WIP to the unofficial repo, I download it, commit the changes, then move the tarball to the trash. However, I don't empty the trash very often, so I looked in there and discovered I had copies of most WIPs and releases dating back to the first version of higan (v091), so I decided to dump those artefacts into the the mix as well. If you go look at the bsnes-history repo now, you'll see there's tags from v005 up to v107 in the same repo. No changelogs for most of them, of course, but better than nothing.

I say "most WIPs and releases"; comparing the list of tarballs to the commits in the unofficial higan repo, I see I'm missing tarballs for v100r12-r16, v101, and
v101r01-r12. No idea what happened there, but if somebody happens to have them lying around (that was 2016-07 to 2016-08, if it helps) I'd love to get a copy.

The ending of the words is ALMSIVI.

Screwtape

Posted on 19-03-19, 10:13

Full mod

Post: #167 of 443
Since: 10-30-18

Last post: 892 days
Last view: 89 days

Having incorporated just about all the historical archives I can find, I've turned my attention to changelogs.

I discovered the tukuyomi collection had a changelog for v045r09 as a separate HTML file for some reason, beside the rest of the changelogs. I wrote a hacky HTML-to-text converter to convert it to the standard changelog format and taught my pile of scripts how to automatically do the conversion and sort it into the right place.

Next, I started poking around at the "bsnes Dev Talk" forum on the old ZBoard. It turns out the forum was locked, so everything's pretty much just as it was a decade ago, I used wget to grab the HTML for every thread in the forum, and wrote a script to scrape the forum HTML and spit out each post into a separate HTML file, neatly labelled with its date and author, and used my HTML-to-text script to convert those HTML files into something a little more human readabla/searchable.

From ~240 thread pages, I got ~5000 individual posts. The tukuyomi collection already covers release changelogs, so I'm mostly interested in WIPs. From the 5000 posts, I grabbed ones by byuu that mentioned "wip" (case-insensitive), which brought the number down to ~140. Of those 140 posts:

- A few were public WIPs, so they mentioned an actual specific version number, like "v034 wip05"
- Most just said "New WIP", so I just blindly guessed that the first such post in a thread was r01, the second was r02, etc.
- Sometimes, I'm pretty confident my numbering is correct. For example, there's a post in the v039 thread where the changelog mentions "changes since wip 19", so I guessed that post described wip 20... and sure enough, there were 19 "new WIP" posts in the thread up to that point.
- Sometimes, I'm pretty confident I got it wrong. For example, there's two "new WIP" posts in the v037 thread, then a post that mentions changes that might have landed in wip 4 or 5, so three WIPs are missing but I have no idea which ones.
- A lot of these could probably be sorted out if I had the archives corresponding to each release. For example, if a changelog says "added multitap support", then whichever archive adds a file like "multitap.cpp" is probably the one.
- A bunch of posts were just talking about WIPs, rather than announcing them.
- A few posts just mentioned "wiping".

Overall, I found 97 new changelogs from v030r03 to v042r05(?) inclusive. My next step is to figure out how to automate converting these files into the changelog format the rest of the scripts expect.

The ending of the words is ALMSIVI.

Duck Penis

Posted on 19-03-19, 15:30

Stirrer of Shit
Post: #98 of 717
Since: 01-26-19

Last post: 1555 days
Last view: 1553 days

Posted by Screwtape

- Sometimes, I'm pretty confident I got it wrong. For example, there's two "new WIP" posts in the v037 thread, then a post that mentions changes that might have landed in wip 4 or 5, so three WIPs are missing but I have no idea which ones.

Four by my count:

Posted by https://board.zsnes.com/phpBB3/viewtopic.php?p=182489#p182489


~\bsnes_v037_wip06>upx -9 bs1.exe
                       Ultimate Packer for eXecutables
                          Copyright (C) 1996 - 2008
UPX 3.03w       Markus Oberhumer, Laszlo Molnar & John Reiser   Apr 27th 2008

        File size         Ratio      Format      Name
   --------------------   ------   -----------   -----------
   1063936 ->    348160   32.72%    win32/pe     bs1.exe

Packed 1 file.

~\bsnes_v037_wip06>upx --brute bs2.exe
                       Ultimate Packer for eXecutables
                          Copyright (C) 1996 - 2008
UPX 3.03w       Markus Oberhumer, Laszlo Molnar & John Reiser   Apr 27th 2008

        File size         Ratio      Format      Name
   --------------------   ------   -----------   -----------
   1063936 ->    306688   28.83%    win32/pe     bs2.exe

Packed 1 file.

Also, I don't think all of the WIPs have neat, sequential numbers like that. Take "bsnes_v039_wip20090307.exe", for instance.

It's a damn shame everything was on byuu.org and that wasn't archived. Google code doesn't have changelogs, only binaries.

Maybe it would be of interest to try and match the dates of news published on https://web.archive.org/web/20071002155056/http://byuu.cinnamonpirate.com/?page=bsnes_news to releases?

There was a certain photograph about which you had a hallucination. You believed that you had actually held it in your hands. It was a photograph something like this.

Screwtape

Posted on 19-03-21, 14:35

Full mod

Post: #172 of 443
Since: 10-30-18

Last post: 892 days
Last view: 89 days

Posted by sureanem
Also, I don't think all of the WIPs have neat, sequential numbers like that. Take "bsnes_v039_wip20090307.exe", for instance.

And of course none of the WIP announcement messages were posted on 2009-03-07. The changelog I've labelled "v039r22" was posted on 2009-03-05, and v040 was posted on 2009-03-09, so who knows what source code that binary corresponds to.

Still, I'm pleased that I could figure that out with a few seconds' searching, instead of just sighing helplessly and wondering. "It's better than nothing" is the preservationist's eternal refrain; it would have been grand if I (or somebody else) had thought to collect all these things at the time, and not as much has survived as I would have hoped, but more has survived than I feared, and focussing on the positive is how we all make it to the end of the day without becoming gibbering wrecks.

It's a damn shame everything was on byuu.org and that wasn't archived. Google code doesn't have changelogs, only binaries.

Google Code? Oh wow, there's a bunch of downloadables there, but it doesn't go back very far and doesn't have many WIPs; I'm guessing the tukuyomi collection already has all that stuff, but I guess I need to double-check just to be sure. Thanks for the suggestion.

Maybe it would be of interest to try and match the dates of news published on https://web.archive.org/web/20071002155056/http://byuu.cinnamonpirate.com/?page=bsnes_news to releases?

That would be pretty cool - again, it's not guaranteed that every news post is a WIP, but some of them mention a version number in the text, and some have screenshots that mention a version number in the title-bar, which is pretty great. I would love to download those pages and add them to the History Kit to be automatically extracted and formatted into git commit messages, but annoyingly those scrapes were done by Alexa, and the Internet Archive page for that scrape says "this data is currently not publicly accessible". Even if it were, I suspect the crawl data is divided up by time, not by site, so it'd be one HTML file from cinnamonpirate.com and a million pages from random other sites.

Definitely something I'd like to do, but would probably need to be done manually, so I'll put that in the "later" basket.

Following up from my previous post, I've pushed a new version of the bsnes-history repo that includes all the changelogs I found on the ZBoard. It's not super-clean, since these changelogs weren't designed to be self-contained, plain-text commit messages - embedded images look weird, and some commit messages have some changelog content along with a bunch of byuu's replies to other people. Again, manual cleanup could make things a bunch nicer, but I'm going to put my effort into more easily automated things for a little while longer.

In particular, the next item on the list is the bsnes thread - before bsnes had a subforum on the ZBoard, it had a single thread that grew so long it made the whole site flaky and unreliable, so it was archived for posterity and deleted from the database. The archive has condensed the three years of posts into a single 13MB HTML file, but it seems to have preserved enough of the phpBB markup that it shouldn't be too hard to extract post-dates and authors and find all the byuu posts mentioning "WIP" again.

Just looking at the thread myself, the earliest version mentioned is v011, which is about the oldest I've seen outside of the tukuyomi collection; that Wayback Machine crawl's first news-post is v014. Again, though, I expect it to be quite difficult to come up with version numbers to assign to any of the posts.

The ending of the words is ALMSIVI.

Duck Penis

Posted on 19-03-21, 15:24

Stirrer of Shit
Post: #106 of 717
Since: 01-26-19

Last post: 1555 days
Last view: 1553 days

Amazing work!

Posted by Screwtape

Google Code? Oh wow, there's a bunch of downloadables there, but it doesn't go back very far and doesn't have many WIPs; I'm guessing the tukuyomi collection already has all that stuff, but I guess I need to double-check just to be sure. Thanks for the suggestion.

That has been archived to GitHub too. I did some quick checking and it looked to have everything, but might be best to go over it properly just to check.

Here's a neat bash oneliner I feel the compulsion to share:

cat <(sort file1.txt|uniq) <(sort file2.txt|uniq) <(sort file2.txt|uniq)|sort|uniq -c|sort -g|grep '^[^0-9]*2'

Replace 2 depending on which operation you want (only in file1, only in file2, in both).

I would love to download those pages and add them to the History Kit to be automatically extracted and formatted into git commit messages, but annoyingly those scrapes were done by Alexa, and the Internet Archive page for that scrape says "this data is currently not publicly accessible". Even if it were, I suspect the crawl data is divided up by time, not by site, so it'd be one HTML file from cinnamonpirate.com and a million pages from random other sites.

Definitely something I'd like to do, but would probably need to be done manually, so I'll put that in the "later" basket.

Try scraping directly from IA instead. You can get the raw HTML (as the crawler saw it) by appending id_ to the date, as in https://web.archive.org/web/20071007054817id_/http://byuu.cinnamonpirate.com/?page=bsnes_news. Then use the file list from e.g. https://web.archive.org/web/*/byuu.cinnamonpirate.com/*. No need to follow the links, so that makes things easier. You'd need to set a cutoff date also, since they changed it to a parking page and some of that got archived.
It should be quick work with a bash oneliner.

Following up from my previous post, I've pushed a new version of the bsnes-history repo that includes all the changelogs I found on the ZBoard. It's not super-clean, since these changelogs weren't designed to be self-contained, plain-text commit messages - embedded images look weird, and some commit messages have some changelog content along with a bunch of byuu's replies to other people. Again, manual cleanup could make things a bunch nicer, but I'm going to put my effort into more easily automated things for a little while longer.

In particular, the next item on the list is the bsnes thread - before bsnes had a subforum on the ZBoard, it had a single thread that grew so long it made the whole site flaky and unreliable, so it was archived for posterity and deleted from the database. The archive has condensed the three years of posts into a single 13MB HTML file, but it seems to have preserved enough of the phpBB markup that it shouldn't be too hard to extract post-dates and authors and find all the byuu posts mentioning "WIP" again.

Just looking at the thread myself, the earliest version mentioned is v011, which is about the oldest I've seen outside of the tukuyomi collection; that Wayback Machine crawl's first news-post is v014. Again, though, I expect it to be quite difficult to come up with version numbers to assign to any of the posts.

Have you compared your results to tukuyomi's changelogs? They seem to mention some WIP releases.
http://black-ship.net/~tukuyomi/snesemu/misc/bsnes_changelog_outdated.html

There was a certain photograph about which you had a hallucination. You believed that you had actually held it in your hands. It was a photograph something like this.

Screwtape

Posted on 19-03-23, 07:14 (revision 1)

Full mod

Post: #177 of 443
Since: 10-30-18

Last post: 892 days
Last view: 89 days

Posted by sureanem
Here's a neat bash oneliner I feel the compulsion to share:

You may be interested in the comm command ;)

Try scraping directly from IA instead. You can get the raw HTML (as the crawler saw it) by appending id_ to the date, as in https://web.archive.org/web/20071007054817id_/http://byuu.cinnamonpirate.com/?page=bsnes_news. Then use the file list from e.g. https://web.archive.org/web/*/byuu.cinnamonpirate.com/*. No need to follow the links, so that makes things easier. You'd need to set a cutoff date also, since they changed it to a parking page and some of that got archived.

That's a neat trick, I'll have to write that down. Thanks!

Have you compared your results to tukuyomi's changelogs? They seem to mention some WIP releases.
http://black-ship.net/~tukuyomi/snesemu/misc/bsnes_changelog_outdated.html

My results are built from tukuyomi's changelogs. Not that *specific* file, since it's labelled "outdated", but bsnes_changelog.txt, higan_changelog.txt, and the extra v045r09 changelog are all there.

I've started digging into the archived bsnes thread, following the same process I used on the ZBoard, but it's more difficult than I expected. phpBB markup has never been particularly semantic or machine-readable, but the ZBoard famously uses the Hermes theme and hasn't been updated in ages, so I assumed that the code I wrote for the current board would work just as easily on the archive.

Turns out, the bsnes_thread.zip archive seems to have been made with the old subsilver theme, and uses much stupider markup. For example, where Hermes wraps bold text with , which is easy enough to detect and handle, subsilver seems to wrap bold text with , so apparently now I need some level of CSS parser to properly understand the content.

Hermes wraps each post in <div class="postbody">, and fills it with text, other block elements (like tables and code-blocks) and s to simulate paragraphs, so I had to teach Python how CSS block layout works to get things to look right. However, subsilver wraps each post in and still uses s. While a ends a *layout* block, it is not technically a block element, and so it can go inside inline elements like . If a were cut short by the end of a block element, like </div> or , the span would be over and we could move on, but when it's cut short by a the span needs to continue on afterward.

Luckily, I only need to parse a few pages, rather than all the Internet, so I don't need the full HTML5 parsing algorithm... but it's still more complex than I'd hoped.

EDIT: And there's no way to distinguish user signatures from the post body! GAR!

The ending of the words is ALMSIVI.

Duck Penis

Posted on 19-03-23, 13:11

Stirrer of Shit
Post: #119 of 717
Since: 01-26-19

Last post: 1555 days
Last view: 1553 days

Why not just go all the way and pull in html5lib or whatever the other parser was?

As for the signatures, it was done in one go, so everyone will have the same signature for all their posts. So for all posts containing ^_________________$, find the longest common suffix for each user, and if it includes the final set of underscores, save that as signature.

Also, there are some issues with character encoding. Not sure if they come into play for the changelogs, but might be worth keeping in mind. Try doing a CTRL-F for "ƒ," for instance.

There was a certain photograph about which you had a hallucination. You believed that you had actually held it in your hands. It was a photograph something like this.

Near

Posted on 19-03-23, 14:27

Burned-out Genius Developer
Post: #19 of 51
Since: 10-30-18

Last post: 1211 days
Last view: 1134 days

As a preservationist, I never would have imagined people spending so much time trying to preserve my own stuff. You live long enough, and you get to experience everything from both sides, I guess.

> it was archived for posterity and deleted from the database

Oooooh, is that what happened to it? That would be super fun to read through again.

> Just looking at the thread myself, the earliest version mentioned is v011

Odd, I'm pretty sure I had been posting most revisions publicly somewhere. Especially since v005 onward was archived by tukuyomi.

It's a shame Deathlike had to go and delete all my posts from the Snes9X phpBB2 forum. There was a lot of interesting technical content that probably would have helped any future SNES emudevs fix common problems with new emulators. Those kinds of notes are always the first thing I go to when I implement new systems myself.

> but the ZBoard famously uses the Hermes theme

I guess whoever did the upgrade really liked the theme on my board.

JieFK

Posted on 19-03-24, 15:14

Post: #4 of 10
Since: 10-29-18

Last post: 1807 days
Last view: 1567 days

(Me, once known as tukuyomi)

Screwtape, you really did a great job with your bsnes history kit. I'm really happy that my archives helped you with this project :)
I do not know how much I can help you, but I'll try to follow this thread more carefully now.
The sole information I can provide now is that I stopped archiving higan stuff few weeks -maybe months- after you started your higan unofficial repository, so there's probably nothing lost in-between ;)
I decided to stop archiving byuu's work -and snesemu too- because your repo is much cleaner than my pile of files randomly saved in some folder.

Byuu, I don't want to sound as a visionary, but I know that time would come someday --the archivist's work archived, that is.
I was there the very first day of bsnes. And, even if I missed (and still miss) the technical background, I followed the bsnes development with enthusiastic interest.
Alas, the thing I mostly regret now is that when I started archiving bsnes, it was already too late, some files were already fourofoured.
On the other hand, the thing I'm proud of is the bsnes changelog, which is for the most part are quotes from your posts on the zsnes and your boards
(and I shamely tell you now that every time I was adding stuff in it, I was a bit afraid that one day you could scold me to stop transcribe everything you said on a txt file :p )
Even if few things are missing, I think it's a good piece of snes emulation scene history.

Anyway, I'm really happy now that my contribution now serves as part of this great project, bsnes history.

Screwtape

Posted on 19-03-25, 07:04

Full mod

Post: #178 of 443
Since: 10-30-18

Last post: 892 days
Last view: 89 days

Posted by sureanem
Why not just go all the way and pull in html5lib or whatever the other parser was?

While I understand that probably nobody else will ever run these scripts, I want other people to be able to easily change the editorial decisions I've made in bsnes-history-kit and make their own bsnes-history repo. As part of that, I've tried to reduce dependencies to a minimum... or at least, to only the most useful ones.

html5lib would do very nicely here, but Python dependency management is (to put it nicely) a ballache. If there were an HTML5-compatible parser in the Python standard library, I'd absolutely use it, but I've got an 80% solution in 20% of the code, so I'm happy with that.

As for the signatures, it was done in one go, so everyone will have the same signature for all their posts. So for all posts containing ^_________________$, find the longest common suffix for each user, and if it includes the final set of underscores, save that as signature.

Wait, are you saying that every signature starts with a fixed number of underscores? That'd make it easy to trim them off... but I guess I'm only really interested in byuu's posts, and he didn't use a signature, so it's not really a problem. Thanks for the info, though.

Also, there are some issues with character encoding. Not sure if they come into play for the changelogs, but might be worth keeping in mind. Try doing a CTRL-F for "ƒ," for instance.

Is that when you open bsnes_thread.html in a browser, or are you looking at the bsnes-history repo somewhere? I can't see any "ƒ" anywhere.

Posted by byuu
> it was archived for posterity and deleted from the database

Oooooh, is that what happened to it? That would be super fun to read through again.

It super-is! You can get the zip file from gitlab, or I believe tukuyomi's collection is back up again, so you can get it from there as well.

My nick on the ZBoard was Thristian ("Screwtape" was already taken), and apparently my first post was encouraging you to make your game database identify games by MD5 instead of CRC32. ;)

Also, until now I didn't appreciate how much effort FitzRoy put into testing things in the early days of bsnes.

> Just looking at the thread myself, the earliest version mentioned is v011

Odd, I'm pretty sure I had been posting most revisions publicly somewhere. Especially since v005 onward was archived by tukuyomi.

It looks like the thread started off as other people talking about this new development in SNES emulation, and then you showed up to join in the conversation.

If I had to guess, I suspect the oldest surviving changelogs are the byuu.cinnamonpirate.com archives on the Wayback Machine. Luckily, I don't think *those* are going away any time soon, so I don't feel rushed to collect them.

Posted by JieFK
Screwtape, you really did a great job with your bsnes history kit. I'm really happy that my archives helped you with this project :)

Thank you so much for actually collecting all that stuff to make it possible!

I decided to stop archiving byuu's work -and snesemu too- because your repo is much cleaner than my pile of files randomly saved in some folder.

I mentioned it above, but this project has really made me appreciate the importance of primary sources and artefacts. If all the source material was preserved, anybody could come along later and reformat and edit it together, like I've done with bsnes-history-kit. With a pre-editorialised archive like the unofficial higan repo, you can never necessarily recreate the original artefacts.

There's been a bunch of things I could have archived over the years and decided not to (things like bass and beat and treble and other assorted tools) because I didn't want to commit to maintaining a full repo like I do for higan. In retrospect, I really wish I'd just saved them all and stuck them in a folder, because even if I didn't get around to making a repo maybe somebody else could have.

-----

Anyway, I have now added changelogs from the bsnes thread. As with the changelogs from the bsnes Dev Talk forum, there's a few WIPs with actual hard version numbers, but most of them are wild conjecture. Also, from context it seems like most actual announcements were being made somewhere else, and (especially early on) there's only occasional references like "thanks for that report, it's fixed in the latest WIP", so some of these "changelogs" might not even correspond to actual releases at all. I guess I'm kind of pinning my hopes on the Wayback Machine providing dating material to clear things up.

The ending of the words is ALMSIVI.

CaptainJistuce

Posted on 19-03-25, 07:54

Custom title here

Post: #357 of 1151
Since: 10-30-18

Last post: 22 days
Last view: 7 hours

Posted by Screwtape

Posted by byuu
> it was archived for posterity and deleted from the database

Oooooh, is that what happened to it? That would be super fun to read through again.

It super-is! You can get the zip file from gitlab, or I believe tukuyomi's collection is back up again, so you can get it from there as well.

My nick on the ZBoard was Thristian ("Screwtape" was already taken), and apparently my first post was encouraging you to make your game database identify games by MD5 instead of CRC32. ;)

I was(and still am) Gil_Hamilton. My first post in that thread was ... laughing at the ATX1 power supply spec.

--- In UTF-16, where available. ---

Duck Penis

Posted on 19-03-25, 12:55

Stirrer of Shit
Post: #123 of 717
Since: 01-26-19

Last post: 1555 days
Last view: 1553 days

Posted by Screwtape
html5lib would do very nicely here, but Python dependency management is (to put it nicely) a ballache. If there were an HTML5-compatible parser in the Python standard library, I'd absolutely use it, but I've got an 80% solution in 20% of the code, so I'm happy with that.

Why do you need an HTML5-compatible parser to parse HTML4?

Wait, are you saying that every signature starts with a fixed number of underscores?

Yes.

Is that when you open bsnes_thread.html in a browser, or are you looking at the bsnes-history repo somewhere? I can't see any "ƒ" anywhere.

In the browser. When I copy paste the raw entities into the file and paste them into the first result for "HTML entity decoder," it's normal. So probably just a misdeclared encoding somewhere. Might be good to fix though so the errors don't propagate up the chain.

There's been a bunch of things I could have archived over the years and decided not to (things like bass and beat and treble and other assorted tools) because I didn't want to commit to maintaining a full repo like I do for higan. In retrospect, I really wish I'd just saved them all and stuck them in a folder, because even if I didn't get around to making a repo maybe somebody else could have.

What about http://black-ship.net/~tukuyomi/snesemu/tools/ ?

I guess I'm kind of pinning my hopes on the Wayback Machine providing dating material to clear things up.

If I had to guess, I suspect the oldest surviving changelogs are the byuu.cinnamonpirate.com archives on the Wayback Machine. Luckily, I don't think *those* are going away any time soon, so I don't feel rushed to collect them.

Ask byuu to ask them to make their scrapes of byuu.org public or to release them to him privately. Unless they deleted everything after he requested they stop scraping, but I don't think it works that way.
The scrapes of cinnamonpirate are kind of spotty, and I think everything that wayback has downloaded or seen from there is also in tukuyomi's collection.

There was a certain photograph about which you had a hallucination. You believed that you had actually held it in your hands. It was a photograph something like this.

Screwtape

Posted on 19-03-25, 14:14

Full mod

Post: #179 of 443
Since: 10-30-18

Last post: 892 days
Last view: 89 days

Posted by sureanem
Why do you need an HTML5-compatible parser to parse HTML4?

Because this content was designed to appear in a browser, so if I want to understand it I need to parse it the same way a browser does, and browsers use the HTML5 parsing algorithm.

Also, I'm not sure I've ever heard of anybody writing an HTML4 parser. Lots of HTML5 (uh, "HTML Living Standard") parsers, because HTML5 specifies how to make sense of real-world web-pages, but previous HTML specifications weren't that relevant to the real world.

In the browser. When I copy paste the raw entities into the file and paste them into the first result for "HTML entity decoder," it's normal. So probably just a misdeclared encoding somewhere. Might be good to fix though so the errors don't propagate up the chain.

OK, *where* in the browser, so I can see how it appears for me, and check how my pipeline has handled it. Also, which browser are you using?

What about http://black-ship.net/~tukuyomi/snesemu/tools/ ?

A lot of good stuff, but maybe if I'd saved everything I'd find something that was missing from this collection, or we'd have confirmation that that's all there was.

The ending of the words is ALMSIVI.

Duck Penis

Posted on 19-03-25, 15:47

Stirrer of Shit
Post: #125 of 717
Since: 01-26-19

Last post: 1555 days
Last view: 1553 days

Posted by Screwtape

Because this content was designed to appear in a browser, so if I want to understand it I need to parse it the same way a browser does, and browsers use the HTML5 parsing algorithm.

Also, I'm not sure I've ever heard of anybody writing an HTML4 parser. Lots of HTML5 (uh, "HTML Living Standard") parsers, because HTML5 specifies how to make sense of real-world web-pages, but previous HTML specifications weren't that relevant to the real world.

The archive was made from a HTML4 page. Won't Python's built-in parser be good enough for any page that validates as HTML?

OK, *where* in the browser, so I can see how it appears for me, and check how my pipeline has handled it. Also, which browser are you using?

Line 81645 of bsnes_thread.html. Firefox 60.6.0esr (64-bit) from Debian stable repos (60.6.0esr-1~deb9u1).


          <tr>
            <td colspan="2"><span class="postbody">２コントローラのコネクタには、<br>
            対応パチンココントローラ以外、接続しないで下さい。<br>
            <br>
            <br>
            Hmmm...My translation is like this:<br>
            <br>
            " Do not connect anything else other than the supported Pachinko controller into the second controller port."<br>
            (more accurate)<br>
            <br>
            When it comes to Kanji...my Japanese is never rusty.<br>
            And,in the future,could you please use Unicode? <img src="images/smiles/icon_biggrin.gif" alt="Very Happy" border="0"><br>
            <br>
            <br>
            BTW,the two profile system of 0.17 wip7 is a great improvement.Much easier to set up and more intuitive.</span><span class="gensmall"><br>
            <br>
            Last edited by kick on Tue Sep 12, 2006 9:29 pm; edited 1 time in total</span></td>
          </tr>

(Note, the forum doesn't escape the entities like it should, but it does NOT have japanese characters when you open it up in less or similar)

A lot of good stuff, but maybe if I'd saved everything I'd find something that was missing from this collection, or we'd have confirmation that that's all there was.

Yeah, fair point. One is none...

There was a certain photograph about which you had a hallucination. You believed that you had actually held it in your hands. It was a photograph something like this.

Screwtape

Posted on 19-03-26, 05:50

Full mod

Post: #181 of 443
Since: 10-30-18

Last post: 892 days
Last view: 89 days

Posted by sureanem
The archive was made from a HTML4 page. Won't Python's built-in parser be good enough for any page that validates as HTML?

Does it validate as HTML4? I haven't checked myself, but I'd be very surprised; a vanishingly small number of web-pages validate cleanly.

Python's "parser" isn't really a parser so much as a lexer: given mis-nested tags like:

<div>Some <b>bold</div> <div>text</b> here</div>

Python's parser will just report:

div start tag
b start tag
div end tag
div start tag
b end tag
div end tag

...and it's up to the user to figure out what that means, if anything. My first attempt failed hilariously because certain HTML tags like don't need to be closed, so I never got an "end tag" event, and wound up with the entire document coiled inside the first line-break.

Line 81645 of bsnes_thread.html. Firefox 60.6.0esr (64-bit) from Debian stable repos (60.6.0esr-1~deb9u1).

Ah, I see it. For me:

- The original bsnes_thread.html encodes the text as regular, pure-ASCII HTML entities, which is good because the page header declares itself to use the "us-ascii" charset (that is, windows-1252)
- The text renders in proper Japanese for me in Firefox Nightly 68a1 on Debian Testing
- My extract-posts script decodes the entities and writes out a body.xml file in proper UTF-8
- None of the affected posts seem to be WIP announcements, so they're not normally converted to text, but if I do that manually they text still shows up in proper Japanese
- Interestingly, that post is a reply to byuu who seems to have pasted Shift-JIS-encoded text into his post (search for "What did I tell you guys about Japanese text")
- byuu's post *does* show up with a bunch of ƒs in it, but all of those are correctly encoded as pure-ASCII HTML entities in the source-code, too

It looks like this particular issue is not causing problems, but thanks for noticing it and bringing it to my attention.

-----

In preservation news, I've updated the copy of the tukuyomi collection on archive.org. Previously I uploaded it as a tarball for maximum compression, but it turns out that IA just offered it up as a downloadable blob. That is definitely Better Than Nothing™, but awkward for other people to refer to.

Today I tried uploading the contents as individual files, but it turns out some of the archives in the collection are corrupt and so IA won't let me upload them.

As a compromise, I made a .zip file of all the contents and uploaded that. It means IA won't give you a nice gallery of images, etc. but at least you can browse the archive's files and download them individually.

The ending of the words is ALMSIVI.

Duck Penis

Posted on 19-03-26, 22:50

Stirrer of Shit
Post: #135 of 717
Since: 01-26-19

Last post: 1555 days
Last view: 1553 days

Posted by Screwtape

In preservation news, I've updated the copy of the tukuyomi collection on archive.org. Previously I uploaded it as a tarball for maximum compression, but it turns out that IA just offered it up as a downloadable blob. That is definitely Better Than Nothing™, but awkward for other people to refer to.

Today I tried uploading the contents as individual files, but it turns out some of the archives in the collection are corrupt and so IA won't let me upload them.

As a compromise, I made a .zip file of all the contents and uploaded that. It means IA won't give you a nice gallery of images, etc. but at least you can browse the archive's files and download them individually.

Bit of a hack, but try uploading via torrent, I think it's be less sensitive that way.
If you don't want to expose your IP, make a webseeded torrent that points to black-ship.net.

There was a certain photograph about which you had a hallucination. You believed that you had actually held it in your hands. It was a photograph something like this.

JieFK

Posted on 19-03-27, 14:00 (revision 1)

Post: #5 of 10
Since: 10-29-18

Last post: 1807 days
Last view: 1567 days

Today I tried uploading the contents as individual files, but it turns out some of the archives in the collection are corrupt and so IA won't let me upload them.

Wow.. Can you tell me which files are corrupt ? Can you provide MD5/SHA* checksums as well?

FTR, this has happened before, can't remember when, nor the version (the only thing I remember for sure is that was bsnes<???>r05.rar) though.
One file were corrupt. After moultes hesitations and multiple-checks that I did not own a sane copy of said file, I resolved myself to delete it.
But then after that, I batch-opened every archives to make sure everything else was safe.

Screwtape

Posted on 19-03-27, 14:16

Full mod

Post: #184 of 443
Since: 10-30-18

Last post: 892 days
Last view: 89 days

According to my notes, bsnes_v055.zip and bsnes_v073.tar.bz2 are corrupt. However, bsnes_v055.zip is the pre-compiled Windows binaries, and the source in bsnes_v055.tar.gz is fine. bsnes v073 is legit missing, but I don't mind *too* much because that's recorded in the unofficial higan repo.

The file that actually caused me grief was ssnes.zip, which (now that I look more closely) I *think* is a redundant copy of the files in the "ssnes" directory beside it? Maybe it can just be removed.

> After moultes hesitations and multiple-checks that I did not own a sane copy of said file, I resolved myself to delete it.

I figure I might as well keep even broken archives. Usually it's still possible to extract *some* files from before the corruption point, and that might be useful to somebody the same way that even broken clay tablets and torn scrolls are valuable to archaeologists.

The ending of the words is ALMSIVI.

JieFK	Posted on 19-03-27, 15:09 (revision 1) Hi.
Post: #6 of 10 Since: 10-29-18 Last post: 1807 days Last view: 1567 days	Posted by Screwtape According to my notes, bsnes_v055.zip and bsnes_v073.tar.bz2 are corrupt. I just tested these two files, I was able to unzip/untar them fine. You can grab a copy from here : > https://nikel.me/~jiefk/snesemu/emus/bsnes/ Also, here is a checksum of the files : > https://nikel.me/~jiefk/snesemu/emus/bsnes/SHA256.txt

Pages: First Previous 1 2 3 4 Next Last

Main » Emulation » Announcing the bsnes history kit

bboard