Published Jan 11, 2007
In my most recent post, I talked about the disadvantages of LALR parser generation as compared to the more general LR method. Although I now believe even more fervently that LR parser generation is the clear winner over LALR, I need to retract part of what I said, and expound on what really works. Indeed, I have since then transcended a technical travail, thanks in part to some most helpful comments from readers.
First, let us informally go over a bit of history. The earliest high level languages were a real challenge to create, partly because the designers did not at that time have adequate language parsing theory at their disposal. This gave rise to a flurry of research, and in 1965, Donald Knuth published the seminal LR paper (Knuth, 1965). This provided a foundation upon which to base general understanding of parsing, but the method for generating LR parsers was extremely computationally intensive, both in terms of space and time. (This method can be found in the dragon book.) Eventually, researchers began to publish derivative methods that were more practical. Of particular note are the SLR (DeRemer, 1971) and LALR (LaLonde, 1971) methods. The original yacc LALR parser generator came into use in about 1973. In ten years, we went from having little understanding of parsing theory, to having powerful tools. Perhaps this is why later work apparently went unnoticed. In particular, Pager (1977) seems to have been practically forgotten.
LALR was devised as a compromise method to get around the extreme cost of the standard LR method. However, Pager’s method delivers full LR power, without any downside. Pager’s method merges states during construction wherever possible, and the generated parser is approximately the same size as that generated by LALR. Computationally, the algorithm is on par with LALR. Finally, the algorithm is no more challenging to implement than LALR [1].
Now, back to my travails. I have been developing a parser generator as supporting software for Lyken, which is a language I am currently developing. Parser generation is a means to an end for the purposes of Lyken, so I was hoping to spend as little time as possible on this part of the project. In fact, I started out using lemon, which is an excellent LALR parser generator. However, as the Lyken grammar neared its final form, I found that the convolutions necessary to wedge it into a form lemon could handle had left me with an unmaintainable mess. Even small modifications were often requiring far-reaching changes, which was going to force me to do some substantial work to normalize the AST (abstract syntax tree) in order to make the guts of the compiler immune to superfluous parser changes [2].
Eventually, I decided to go back and look more carefully at what GLR parsing could do to alleviate the problems I was having. There is an excellent paper by Scott McPeak that describes in detail how to implement a practical GLR parser (McPeak 2002), and I was sufficiently convinced of GLR’s ability to help me clean up the Lyken parser mess that I took a step back and implemented a parser generator from scratch [3]. The parser generator ended up taking about 200 hours to implement, not because it requires lots of coding, but because I had to learn a lot along the way.
I ran into a myriad of stumbling blocks, among which the most serious are listed below:
Overall, the time requirements for this project were about what I estimated beforehand, but what actually took up that time surprised me in the extreme.
[1] In the interest of full disclosure, I mention here that Pager presents two methods for determining state compatibility during state merging. “Weak compatibility” is in practice perfectly adequate, and simple to implement, though there are some edge cases, rarely encountered in practice, in which a limited number of states are duplicated. “Strong compatibility” guarantees that the state machine is minimal, though it requires a bit more sophisticated programming, and appears to be included in the paper mainly for completeness, rather than due to practical need.
[2] Also of note, Lyken requires two tokens of lookahead in one case, but this was not a critical parser issue, since I was able to convert the lookahead within the scanner.
[3] Elkhound appears to be an excellent parser generator, but I am bootstrapping Lyken with Python. I had already gone through enough trouble gluing Lyken’s C-based scanner/parser and the Python portions of the compiler that I felt it worthwhile to move to writing the bootstrap compiler purely in Python if I was going to be rewriting the parser anyway. Unfortunately, there were no parser generation tools for Python that met my needs, which is why I wrote one from scratch.
DeRemer, F.L. (1971) Simple LR(k) grammars. Comm. ACM 14, 453-460.
Thanks for your post. it helped me to know about the other algorithms out there. Also, i was wondering.. how can i acces the tech report by LaLonde? I tried the CSRG listing of tech reports. But this is not mentioned. Can you help me regarding this?
At November 14, 2008 7:14 AM, Jason said…
srishtee, I don’t have a copy of the LaLonde technical report in my archives, so it looks like I was never able to find it either.
This is now a pretty dated post, but for the record, I would like to mention that I have a copy. :)