Deprogramming the LALR(1) Bias
Look around for LR(1) parser generators, and you will primarily find LALR(1) parser generators. There seems to be an unspoken assumption that LALR(1) is somehow better than LR(1), but look at the following pertinent facts:
- In terms of grammars that can be handled by various parser generation techniques, SLR(1) ⊂ LALR(1) ⊂ LR(1).
- SLR(1) and LALR(1) tables are always the same size, but LR(1) tables are potentially larger.
So then, why is there a bias toward LALR(1)? I suspect that it has to do with the well known and widely cited dragon book, which treats LALR(1) as the culmination of the parser generation algorithms it presents. There is one little detail that is not mentioned at all though: it is possible to compress LR(1) tables to be the same as LALR(1) tables, with the exception of avoiding table compression that would introduce LALR(1)'s reduce/reduce conflicts. Granted, the book is over 20 years old now, but I would not be surprised if the new edition preserves this omission.
I recently finished implementing a parser generator (as the basis for a GLR parser, as described by the Elkhound technical report). I was initially unable to wrap my head around the "efficient" method that the dragon book provides for LALR(1) parser generation, so I took the incremental approach of implementing SLR(1), converting that to LR(1), then finally converting that to LALR(1) -- the "easy, inefficient" method according to the dragon book. Well, when I got to the point of compressing LR(1) tables to LALR(1) tables, I questioned the necessity of compressing states when reduce/reduce conflicts would result. As near as I could tell, there is no fundamental requirement to do so, which means that compressed LR(1) tables that avoid such conflicts should be the gold standard, rather than LALR(1).
This seemed too obvious to have been overlooked by the compiler community, so I looked high and low for prior art. The best I found was a comp.compilers post by Chris F. Clark on 1 July 2003, which says, in part:
That brings us back to "full" or canonical LR(k), where k is >= 1.There is likely some discussion of this optimization somewhere in the primary literature, but several hours of searching failed to turn it up. The closest I found is Sönke Kannapinn's PhD thesis (in German, but the abstract is also in English), which comes to the same conclusion via a very different route.
LR(k) parsing preserves the needed left context to disambiguate
certain situations where LALR(k) finds the rules to conflict. The
canonical way of doing this is to not merge states with different
lookaheads in the first place. That causes an explosion in the
table size as many states are kept distinct simply because they
have different lookaheads, when in reality the lookahead for those
states will never be consulted. A more modern method for solving
the problem involves splitting the states only when a conflict is
detected. Then, if the grammar is LALR, no splitting will occur
and one has the same size machine as the LR(0), but as conflicts
arise the require left context to resolve, the tables slowly grow.
In summary, if you need to write a parser generator, make it LR(1) and use careful table compression, rather than using the less general LALR(1) approach.

12 Comments:
Long ago (1973) there was a paper by David Pager (U. of Hawaii) on the "lane tracing" algorithm to construct LR(k) parsers.
ABSTRACT. The paper presents, as far as the author is aware, the first practical general method for con-
structing LR(k) parsers. It has been used, without computational difficulty, to produce LR(1), LR(2) and
LR(3) parsers for grammars of the size of ALGOL.
Search "david pager lane tracing" on Google and you might ba able to get a copy.
Excellent, thank you for the reference, georges-m-r.louis. I finally broke down and re-joined the ACM this morning in order to get access to various papers on this subject. Another apparently pertinent paper is Efficient Full LR(1) Parser Generation, by David Spector.
I've been fighting LR(1) parser generation performance problems for a couple of weeks now, so these papers are of keen interest at the moment.
The Menhir parser generator for O'Caml is an LR(1) parser generator based on Pager's algorithm. The reference given in Menhir's manual is:
David Pager. A practical general method for constructing LR(k) parsers. Acta Informatica, 7:249–268, 1977.
Although O'Caml comes with a LALR parser generator (ocamlyacc) Menhir is quickly replacing it as the preferred tool of many developers.
Another way to look at it is as follows: How many reduce-reduce conflicts have you encountered which was due to the fact that you used a tool which only accepted LALR grammars instead of full LR?
On one level I totally agree with you, parser generators should accept LR grammars. But on the other hand I have personally never been bitten by the LALR restriction and so I really couldn't care less. I prefer to fight the battles that actually make a difference. This one does not.
Hi, I once tried this approach. I built an LR(1) parser generator. When I applied it to a substantial grammar (ADA I think) I found it to be painfully slow compared to its LALR(1) counterpart. All the extra states resulted in lots of churning away. Snappy LALR(1) is much better when doing grammar development.
Note: I haven't tried the optimization mentioned.
josef, I agree that this isn't a battle worth bleeding over, but it's surprising to me that there's a battle to be fought at all. =) I was surprised to discover that LALR is actually a pessimization as compared to LR. As for encountering LALR-induced reduce-reduce conflicts, I encountered them repeatedly while using the lemon LALR parser generator to develop the grammar for Lyken (the language project which is motivating my parser generation efforts). The critical issue with these conflicts was that I didn't have sufficient understanding of the parser generation algorithms at the time to be able to fully understand why something that looked completely legitimate to me was causing conflicts.
adrian, I had the same performance experience with my parser generator when I applied it to the Lyken grammar. Parser generation would have taken days, maybe even weeks, to complete. Fortunately, the 1977 paper by David Pager that chris k mentioned describes an algorithm that completely solves the algorithmic complexity issue. I was able to read that paper and retrofit its algorithms into my parser generator in one (admittedly very long) day. The biggest challenge was that the paper uses very different terminology from what I've encountered elsewhere. For the Lyken grammar, the Pager LR(1) algorithm currently takes approximately 4X as long as the SLR(1) algorithm does. This is for 105 tokens, 107 non-terminals, 300 productions, and 544 resulting states, which is quite large enough to demonstrate that there are no significant algorithmic scaling problems.
So do you think you have come up with a full solution for LR(1) parser generator using Pager's algorithm? Something like yacc or bison that can join the open source community? I'm really interested to know. I have not found another one yet.
I know that around 1981 there was a LR parser generator implementing Pager's algorithm in fortran. Then the Menhir parser generator for O'Caml as mentioned by Chris. I'm looking around to see if a implementation in C/C++ exists.
tom, yes, I am confident that my parser generator implementation is in fact LR(1), though its API is not the traditional YACC/bison one. chris k's comment mentions Menhir, which is also apparently LR(1), using the same basic approach.
I plan to release my Python-based parser generator as open source; the only reason for the delay is that I wanted to shake out the bugs. It has been pretty solid for a while now, so now I just need to spend a bit of time publishing it.
In your post: "For the Lyken grammar, the Pager LR(1) algorithm currently takes approximately 4X as long as the SLR(1) algorithm does. This is for 105 tokens, 107 non-terminals, 300 productions, and 544 resulting states, which is quite large enough to demonstrate that there are no significant algorithmic scaling problems." So approximately what's the time/memory cost for this?
Also for the original Knuth LR(1) implementation in Python, did you have a time/memory cost estimation?
Hope to see your publication soon.
t, I don't know the algorithmic complexity of Pager's algorithm off the top of my head, but I can give you some rough idea of performance for the Lyken grammar.
As an aside, the Lyken grammar has since grown to 104 tokens, 117 non-terminals, and 525 productions. This is because I found GLR parsing to have some unacceptable exponential blowup possibilities (due to ambiguity between lvals and expressions) that essentially required the removal of all ambiguity. As a result, the Lyken grammar is now completely unambiguous, at the expense of a large number of extra productions.
On an Opteron 2.2GHz system, it takes 5:17 and approximately 220MB RAM to generate the parser pickle. It is worth noting that the number of productions is not a strong predictor for parser generation time -- I have two sets of 32 interrelated productions (for procedure declarations and calls) that are to blame for perhaps 3/4 of the total runtime.
Experiments with a C-based Python module that used the Knuth LR(1) algorithm showed Python to be about 100X slower than C. Extrapolating from those results, I would expect a C-based Pager LR(1) parser generator to be able to process the Lyken grammar in 5-10 seconds at the most. This is on par with the runtimes for generating a Lyken parser with lemon, which is an LALR(1) parser generator.
As for the original Knuth LR(1) implementation, it would have exhausted memory long before finishing (it would have required at least hundreds of gigabytes), and it would have taken at least months to complete. I let it run for nearly three days, and it was far from finishing, but using nearly 16GB of memory.
I would be happy to compare my LR(1) parser generator construction experiences with other people. My product is called: LRGen 7.0 and it is generating parsers in about the same time as my other LALR parser generator.
For the DB2 grammar, one of the largest in my possession, it takes about 1.5 seconds to generate the parser tables. Most of the time is spent in compressing the tables.
I compiled it with the Microsoft Visual Studio 2005 C/C++ compiler.
I used the Pager concept of merging the states during construction and it was rather difficult, I think, but it runs very fast. I used a few other tricks also.
Note the DB2 grammar produces over 3,200 LALR type states (LR(0) construction states).
Paul B Mann
paulbmann.com
Post a Comment
<< Home