Screwtape |
Posted on 18-11-22, 12:27
|
Full mod
Post: #43 of 443 Since: 10-30-18 Last post: 1101 days Last view: 172 days |
In university I learned about shift/reduce parsers with lex and yacc (well, flex and bison), and it made parsing seem like crazy wizardry where you cobbled together a thing and beat on it until it worked and never touched it again for fear of breaking it. A few years ago, I learned about PEG/Packrat parsers, which were much, much nicer to deal with - it's just regular functions calling regular functions, so you can write and understand them like any other program. Well, any other program that does a ton of indirection and recursion. And it's not too difficult to accidentally make your grammar infinitely recursive and get a stack overflow. More recently I learned about Earley parsers, which aren't the most efficient or the most straightforward, but (I'm told) are the most robust. Simple grammars run in linear time, more complex grammars run much slower, but at least they run instead of crashing or getting stuck in an infinite loop. I even heard that they handle ambiguous grammars, and just give you *all* the possible parse trees... but I might have misunderstood. Unfortunately, about the only person on the Internet who's written anything about Earley parsers is a guy who wrote a Perl implementation called Marpa, and almost every discussion I've found on the topic winds up at "Marpa is cool and you should use it" rather than "Here's how to write an Earley parser". Luckily, I found one actual tutorial, and I've been slowly working my way through it. I've been writing software for a couple of decades now, I've done a lot of stuff and I'm confident about being able to tackle anything I come across. But I've mostly been doing the same kinds of things all that time, and eventually I forget what it's like to try something weird and alien. I got most of the way through the first page of the tutorial easily enough, and then I got to a part where the tutorial said "...and obviously our program produces output X" while my program produced output Y, and I had no idea what had gone wrong. Eventually I figured it out and made my way onto the second page of the tutorial, which introduces a new wrinkle to the problem. This time the tutorial said "...our program produces buggy output X", and instead, my program straight up crashes. I have heard it said that the most important skill when learning to program is the ability to imagine what the computer is doing, step by step. I'm sure the second most important skill is the patience to work through ridiculously unhelpful errors. The ending of the words is ALMSIVI. |
Screwtape |
Posted on 18-11-26, 10:25
|
Full mod
Post: #44 of 443 Since: 10-30-18 Last post: 1101 days Last view: 172 days |
I am still plugging away at this. I have gone through the "basic recogniser", "dealing with empty rules" and "dealing with right-recursive rules" sections of the tutorial, and I believe my code all works as it should and gives me linear-time parsing, etc. etc. The first section, the "basic recogniser", went pretty smoothly. The second section, about empty rules, showed up some problems I hadn't noticed in the first section, but after fixing them and carefully considering the tutorial text, I figured it out, and was happy. The third section, about right-recursive rules, was a bit hairy. This time, the tutorial started off basically saying "this is *kind of* how you do this, but it makes life complex and I'm not sure it's a good idea", which was worrying. It described the problem and kind of sketched a solution, which helped a bit but wasn't the guided tour I was hoping for. The tutorial also linked to somebody else's comment on the tutorial, which basically said "don't get confused by the original academic paper, do it eagerly instead of lazily" which didn't help at all since I wanted to be *un-confused*, not to study a confusing academic paper and then study a pithy decryption key. So I tossed all that aside and tried to figure something out from first principles, and after banging my head against the wall for a few hours I finally got something that seems to work. Next up is the fourth section, about turning this cool recogniser into an actual parser. I had a brief read of it this evening, and it doesn't seem like there's anything tricky in this at all. That's good, since it means I don't have to puzzle out subtleties. It's also bad, since it means a bunch of brute-force programming. I'm not even sure what data structures I'll need; the tutorial has a whole bunch of stuff about how to select a parse-tree to return if the grammar is ambiguous... but I didn't pick a parsing algorithm that handles ambiguous grammars to not return a parse forest at the end. I guess we'll see how it goes. I wonder if I'll have to write a literate implementation guide for Earley parsers like I did for suffix arrays. The ending of the words is ALMSIVI. |
Kawaoneechan |
Posted on 18-11-26, 10:31
|
Now with less self-mockery
Post: #89 of 599 Since: 10-29-18 Last post: 195 days Last view: 23 min. |
I liked your suffix arrays guide so I'm kinda looking forward to this. |
Screwtape |
Posted on 18-11-27, 11:07
|
Full mod
Post: #45 of 443 Since: 10-30-18 Last post: 1101 days Last view: 172 days |
I sat down to work on making a parse-tree today, and because I want to work with ambiguous grammars, my first test-case was the good ol' "dangling else" situation:
Does the "else" belong to the first "if" or the second? Properly, it's ambiguous; according to my code it's "Invalid input: got 'else', expected EOF". I pretty quickly figured out why that was happening, but when I fixed it, it broke my "empty rule" handling. I spent a good few hours adding print-statements to my code to figure out what it was doing, trying to look up other Earley parser guides and tutorials to see if they explained things any better, and there might even have been some solitaire games in there somewhere to clear my mind. Eventually I decided that I must have misunderstood a crucial chunk of the tutorial - it explained a crucial idea briefly, in English, and it didn't occur to me there might have been interpretations other than mine. I tried a different approach and then all my test cases passed at the same time. Huzzah! Surely that must be the very last bug in my code, and I'll spend *tomorrow* figuring out parsing. The ending of the words is ALMSIVI. |
Broseph |
Posted on 18-11-27, 14:39
|
Post: #18 of 166 Since: 10-29-18 Last post: 1561 days Last view: 1237 days |
Posted by Screwtape Yeah, I dislike that kind of ambiguity. Pretty sure GML would treat the 'else' as belonging to the second 'if' but whenever I check code written by others first thing I do is remove that ambiguous stuff.
Hm...can't get spaces to work when using [code] [ /code] it seems. The board just removes any space before a character on a new line and anything that has more than one space. |
Kawaoneechan |
Posted on 18-11-27, 15:21 (revision 1)
|
:3
Post: #95 of 599 Since: 10-29-18 Last post: 195 days Last view: 23 min. |
Posted by BrosephI'll look into it. Edit: adding a pre in there seems to help?
but I'd prefer using the source tag instead
reg_t kPrintDebug(EngineState *s, int argc, reg_t *argv) {
const Common::String debugString = s->_segMan->getString(argv[0]); debugC(kDebugLevelGame, "%s", format(debugString, argc - 1, argv + 1).c_str()); return s->r_acc; } |
Broseph |
Posted on 18-11-28, 10:29 (revision 1)
|
Post: #20 of 166 Since: 10-29-18 Last post: 1561 days Last view: 1237 days |
Works fine now. Thank you.
or to save space if there's only one line of code with the second if/else:
|
Sintendo |
Posted on 18-11-28, 20:33
|
Post: #4 of 17 Since: 10-29-18 Last post: 1802 days Last view: 1374 days |
Posted by ScrewtapeYou could say the tutorial was... ambiguous. |
Screwtape |
Posted on 18-11-29, 11:55
|
Full mod
Post: #48 of 443 Since: 10-30-18 Last post: 1101 days Last view: 172 days |
> You could say the tutorial was... ambiguous. Ayyyyyup. Long story short, I now have working code that builds a parse tree. Just as I expected, I spent a *long* time scraping away at this, with no particular insights or sudden moments of clarity. It took me a couple of attempts to figure out what the resulting data structures should look like, and then it took even longer to figure out how to fill them in. The nature of an Earley parser is that there's a bunch of phases that the parser goes through for each character, before and after it updates the "current offset", and things are usually filed under the offset in which they completed, but if you do a thing *before* the current offset is updated you might need to file it as though it happened *after* the offset was updated, and does a thing complete on the offset of its last character or the offset *after* the last character, and even if it completes on the offset of the last character maybe you want to *file* it under the following offset to make range arithmetic easier... Basically, it was a rats' nest of off-by-one errors that I had to trace through and understand. I couldn't just hammer at it until it worked, I had to make sure there weren't equal-and-opposite errors cancelling each other out. But it's done now, I got it working and I begin to have faith that maybe I can understand this contraption after all. Now that the tests pass, the next step is to refactor and simplify. I already found a collection field that always had exactly one item in it, so I cleaned it away. There's another compound data-type I'm using in a bunch of different places that should really be its own data-type, so I can hang some methods off it instead of doing the same verbose tickling in a bunch of places. I went through and marked as "private" all the things I could, so the auto-generated documentation is very clean and focussed, which I like. So, yeah, next up: cleaning and tidying and polishing, before I start building the next crazy new thing on top. I'm looking forward to it really, since it won't require deep thinking. I can just make a change, re-run the tests, and if they fail, undo and try something else. The ending of the words is ALMSIVI. |
Screwtape |
Posted on 18-12-07, 11:55
|
Full mod
Post: #53 of 443 Since: 10-30-18 Last post: 1101 days Last view: 172 days |
It turns out I do not have working code that builds a parse tree. Or rather, it *generally* builds a parse tree, but there's a specific case where things break. In the tutorial I linked, the page "Optimizing Right Recursion" describes an algorithm whose logic is approximately "keeping track of all the details is expensive, so we'll just remember the big picture". Turns out, that's fine if you just need to recognise whether an input matches a grammar, but if you want to reconstruct the parse tree (as I do) then you legitimately *do* need all those details. Apparently this right-recursion optimisation is an important part of the Earley/Marpa algorithm's big-O performance guarantees, so I don't want to toss it. On the other hand, I can't really say I support grammars with right-recursion if they just mysteriously fail. I think maybe it's time to email the guy who wrote the tutorial and ask for further illumination. The ending of the words is ALMSIVI. |
wareya |
Posted on 18-12-07, 16:30
|
Post: #16 of 100 Since: 10-30-18 Last post: 1781 days Last view: 1346 days |
How does a parser have issues with RIGHT recursion? Isn't that, like, the opposite of the norm? |
Screwtape |
Posted on 18-12-07, 23:52
|
Full mod
Post: #54 of 443 Since: 10-30-18 Last post: 1101 days Last view: 172 days |
Recursive descent (PEG, combinator) parsers have problems with left-recursion because they're eager: they immediately try the first option and backtrack if it didn't work out. However, Earley parsers are lazy: they keep matching incoming tokens against grammar rules, ruling out possibilities as they go, and don't actually need to do much of anything until they find a complete rule. As a result, when a rule does complete, it can trigger an avalanche of completions. This right-recursion optimisation works by skipping over all the intermediate stages of the avalanche and going right to the top... but thoes intermediate stages still need to be part of the parse tree, so I'm a bit stuck. The ending of the words is ALMSIVI. |
Screwtape |
Posted on 18-12-10, 10:18
|
Full mod
Post: #57 of 443 Since: 10-30-18 Last post: 1101 days Last view: 172 days |
I mentioned Earley parsing on Lobsters the other day, and somebody replied with a link to a parsing tutorial of their own. It's less wordy than the first one I tried, but it comes with automatically tested code in Python which I'm much more familiar with: and best of all, it actually acknowledges that the problem I'm facing exists and provides a worked solution! Unfortunately, things are never simple: - The Earley algorithm keeps track of things by their end-offset, so when the parse is done it's natural to start at the end and work your way back to the beginning. However, both tutorials I've seen devote a lot of effort to re-indexing all the parse data by its start offset because it's "less confusing". I'm not confused by the Earley model, but I *am* confused when I try to convert "just do X, Y and Z' instructions to my inverted frame of reference - Pulling a parse tree out of the Early working data involves finding and stitching together the few relevant fragments. The tutorials I've seen seem to run through the input and then as a second phase wade through the results for the interesting bits, but it seems to me you can recognise the interesting bits as and when they occur, so why not build the parse-tree as you go? Unfortunately, this makes it very hard to map concepts from the tutorials onto my code-base. After staring at this new tutorial for a while and digging around, I eventually came up with a simple solution that fixed my broken test-case. I wasn't sure it was completely correct or preserved the linear-time property, but "seems to be working" always feels a lot better than "definitely broken". The new tutorial also came with some extra examples and test-cases, so I added those to my test suite... and they broke. In fact, if I comment out the whole "optimization" entirely (which should be safe since optimizations shouldn't affect correctness), one of my original test-cases is still broken. It's 9PM and I'm too tired to think any more tonight, but at least I'll have actual reproducible problems to dig into tomorrow. The ending of the words is ALMSIVI. |
wareya |
Posted on 18-12-10, 20:10 (revision 1)
|
Post: #17 of 100 Since: 10-30-18 Last post: 1781 days Last view: 1346 days |
> - Pulling a parse tree out of the Early working data involves finding and stitching together the few relevant fragments. The tutorials I've seen seem to run through the input and then as a second phase wade through the results for the interesting bits, but it seems to me you can recognise the interesting bits as and when they occur, so why not build the parse-tree as you go? Unfortunately, this makes it very hard to map concepts from the tutorials onto my code-base. I don't actually know anything about the kind of parser you're trying to make, but this sounds just like lexer-parser and parser-compiler separation to me. It probably simplifies things a lot, structurally, to have the different jobs be completely separated, even if it looks like you can do them at the same time. |
Screwtape |
Posted on 18-12-11, 06:29
|
Full mod
Post: #58 of 443 Since: 10-30-18 Last post: 1101 days Last view: 172 days |
I hadn't thought of it that way, but after looking into it more closely, I believe you're right. The core idea of yacc/bison parsing is shift vs. reduce, the core idea of PEG parsing is recursive descent, but the core of Earley parsing is a soup of parse-rules in progress whose convection is fuelled by the arrival of new tokens. One of the most basic things about rule-soup is that the rules are unordered, and although the algorithm will advance all the relevant rules eventually, there's no guarantee that they'll show up "in document order" as the HTML specifications say. I still think it's a good idea to collect completed rules into a separate data structure as they're found, instead of rummaging in the soup for them afterward. HOWEVER, I had assumed that when a rule was completed all its symbols would already be completed, and that is not true. All its symbols will be completed before we receive the next token, but they might not be completed yet. Now that I think back, I've been bitten by this assumption a couple of times already, but I just assumed it was an implementation flaw, rather than a design flaw. So I need a new approach: collect completed rules as they occur into a separate data structure, but don't try to impose any structure. When parsing is complete, we can assume all the completed rules have arrived, and we can build our parse tree in one go. And *if* we're going to stuck completed rules into a separate data structure, we might as well index them by starting offset rather than ending offset... And thus I wind up at the same approach everybody told me to use, except that now I understand why it's important. Now to try and implement it, and see what my original problem looks like by the time I work my way back to it... The ending of the words is ALMSIVI. |