Compilerware
Open Source Fast LALR(k) Parser Generator
 DOWNLOADS 
AMDG
INTRODUCTION Downloads Feedback From Users Release Information PRODUCTS LRSTAR DFASTAR DFASTAR/c THEORY Definition Of Terms Theoretical Information COMPANY About Us Contact and Support  
Donate SourceForge Page

Theoretical Information

NFA, DFA, PDA

NFA

NFA means Nondeterministic Finite Automata.  An NFA may have more than one choice of action (which state to go to next) for an input symbol.  If it makes the wrong choice and finds out after more symbols are read, then it has to backtrack and try another path.  Consequently, NFA's may be slow and are undesirable.

DFA

DFASTAR means Deterministic Finite Automata.  A DFA has only one choice of action (which state to go to next) for each input symbol.  Consequently, it never has to backtrack and try another path.  DFA's are desirable, because they are faster and simpler.  DFA processors do not use a stack or make reductions.  Therefore, they cannot be used to parse computer languages.  They are used for lexical analysis only.

PDA

PDA means Push Down Automata.  A PDA is a DFA with a stack.  A PDA processor has a stack which it uses like a memory.  PDA's can be used for parsing computer languages.  PDA's can be used for lexical analysis also, however, the speed of DFA's are 5 times faster than PDA's, when used for lexical analysis (based on our research).

Common Error

Some people talk about LALR DFA's.  This is wrong terminology.  An LALR parser uses a stack and operates as a PDA processor, not a DFA.



LALR

LALR is a computer science acronym for LOOK AHEAD LEFT RIGHT.  It is a method of parsing computer languages.  Parsing is recognizing patterns in input statements that match rules in the grammar which was used to create the parser.  When parsing a computer language, an LALR parser:

  • Uses a LOOK AHEAD symbol to aid the recognition process,
  • Reads input statements from LEFT to right,
  • Makes reductions on the RIGHT first, working backward toward the left side of the parse stack. This is also called bottom-up parsing.



Myths

LALR cannot handle context-sensitive languages like C++.

This is not true.  LRSTAR has a semantic grammar symbol that solves this problem (e.g. {typedef}).  At parsing time, when an incoming <identifier> could be ambiguous, a symbol-table lookup will change the <identifier> to a {typedef} which is not ambiguous.  Thus, LRSTAR parsers can handle difficult computer languages like C++.

LALR does not work when you need to have actions embedded in the rules of the grammar.

This is not true.  Because of the integrated symbol-table manager and automatic AST constructor, you rarely need to have actions embedded in the rules.  You can wait until the rule is recognized and then execute your actions.  Actually, for actions that produce output, it's better to build the AST while parsing and then after parsing, execute the actions during a traversal of the AST. 

LALR, bottom-up parsing, is too difficult and complicated.  LL or top-down parsing is easier.

This is a confusing topic.  LALR grammars are actually more natural to write than LL grammars when you don't try to embed actions in the rules.  An LL parsing algorithm is actually more complicated because it always requires more than one lookahead symbol for real-world languages.  LALR usually does NOT need more than one lookahead symbol to make parsing decisions, so the LALR parsing algorithm is simpler.

What makes LALR seem difficult is the reporting of conflicts in your grammar.  Most conflicts are shift-reduce conflicts and are NOT a problem.  Those are just like warning messages.  You can turn off those messages with LRSTAR, by using the 'rr' option.  The 'rr' option only shows reduce-reduce conflicts, which are a serious problem.  The reduce-reduce conflicts can be resolved by an LALR(k) or an LR(k) parser generator, if the grammar is not truly ambiguous. 



LALR vs LL

LALR(1) is a larger language class than LL(1).  Therefore, LALR(1) is a more powerful notation than LL(1) for defining a lexer or parser, where (1) indicates the number of lookahead symbols used by the parser to make parsing decisions.  LL(k) is often used to overcome the deficiency of LL(1), however, the resultant LL(k) parsers are slower and more complicated.  A rough estimate NOT based on testing results would suggest that LALR parsers are at least 2 times the speed of LL parsers and possibly 10 times the speed of LL parsers.  LALR has its complexities as well, but that is a huge topic for another web page.  Quite often, personal preference is the deciding factor.  Many people prefer to write LALR grammars.



SLR, LALR, LR(1)

LR(0) State Machine

Both SLR and LALR parser generators construct an LR(0) state machine and have the same states.  However, they differ in the computation of the lookahead set for each state.  The SLR method may report erroneous conflicts, which LALR never does.  So LALR is the preferred method.

SLR(1) Lookahead Sets

SLR is a term that means Simple LR.  An SLR parser generator uses a simple method to compute the set of lookahead symbols for each state.  It uses the BNF grammar to compute the lookahead sets.  It looks at every occurrence of a nonterminal symbol in the grammar to see what terminal symbols may follow it.  This is called the Follow Set.  The context in the grammar determines the lookahead sets.  This is actually too simple and consequently an SLR parser generator may report conflicts that really do not exist in the LR(0) state machine. 

LALR(1) Lookahead Sets

LALR is a more powerful concept than SLR.  An LALR parser generator uses a more complicated method to compute the set of lookahead symbols for each state.  It looks at the LR(0) state machine rather than the grammar to compute the lookahead sets.  LALR gives 100% accurate lookahead sets for the LR(0) state machine.  If any conflicts are reported by LALR, they are true conflicts.  If conflicts are reported, then more than 1 symbol of lookahead is required to resolve the conflicts or perhaps the grammar is truly ambiguous for any number of lookahead symbols.

LALR(k) Parsing

Quite often you can rewrite your grammar to reduce the number of conflicts.  However, most of the conflicts are usually shift-reduce conflicts and the default action chosen by the parser generator is the shift action, which is what you want 99% of the time.  So the only conflicts you have to worry about are the reduce-reduce conflicts. LRSTAR parsers can usually resolve the reduce-reduce conflicts at parsing time, by doing LALR(k) nondeterministic parsing (see option 'nd').  Usually only a few states have reduce-reduce conflicts and require more than one symbol of lookahead, so the impact on parsing speed is minimal.

Canonical LR(1) Construction

LR usually means Canonical LR(1) and it's a more powerful concept than LALR.  A Canonical LR(1) parser generator computes the LR(1) state machine, instead of LR(0).  This is fine for small toy grammars, but for real-world computer languages, way too many states are created.  For the COBOL-85 language definition, over 2,000,000 states were created by a canonical LR(1) parser generator, whereas only 1,720 states were created by an LALR parser generator.  Canonical LR offers only one advantage over LALR.  The error handling part of the LR parser algorithm is simpler, because no useless reductions will be made when an error is encountered.  Thus, error-recovery algorithms are simpler.

Minimal LR(1) Construction

Minimal LR is another type of parser generation algorithm which is the best of all.  Minimal LR gives the recognition power of LR(1) and the small table size of LALR.  There are two different ways to do the Minimal LR parser-table construction.  One way is to do build the LR(0) state machine first and then do state splitting (duplicating states) to resolve conflicts.  The other way is to attempt to build the LR(1) state machine but merge states with "compatible" lookahead sets while building the LR(1) state machine.  This will keep the number of states from "exploding".

LR(1) vs. LALR(1)

In the real world, I have not yet seen a computer language that is LR(1) and not LALR(1).  In my experience, computer languages that are not LALR(1) are not LR(1) either.  They are LR(2) or LR(3).  These can be handled by an LALR(1) parser with the addition of hand-written code to look ahead and make a parsing decision.  Or an easier way is to use an LALR(k) parser generator which can automatically resolve the conflicts at parsing time. 

© Copyright Compilerware 2012.  All rights reserved.