Introduction

Recursive-descent parsers have always interested me, and in the past year and a half I wrote a few articles on the topic. Here they are in chronological order:
  • Recursive descent, LL and predictive parsers
  • Some problems of recursive descent parsers
  • A recursive descent parser with an infix expression evaluator

The third article describes a method that combines RD parsing with a different algorithm for parsing expressions to achieve better results. This method is actually used in the real-world, for example in GCC and Parrot (source).

An alternative parsing algorithm was discovered by Vaughan Pratt in 1973. Called Top Down Operator Precedence, it shares some features with the modified RD parser, but promises to simplify the code, as well as provide better performance. Recently it was popularized again by Douglas Crockford in his article, and employed by him in JSLint to parse Javascript.

I encountered Crockfords article in the Beautiful Code book, but found it hard to understand. I could follow the code, but had a hard time grasping why the thing works. Recently I became interested in the topic again, tried to read the article once more, and again was stumped. Finally, by reading Pratts original paper and Frederick Lundhs excellent Python-based piece [1], I understood the algorithm.So this article is my usual attempt to explain the topic to myself, making sure that when I forget how it works in a couple of months, I will have a simple way of remembering.

The fundamentals

Top down operator precedence parsing (TDOP from now on) is based on a few fundamental principles:

  • A "binding power" mechanism to handle precedence levels
  • A means of implementing different functionality of tokens depending on their position relative to their neighbors - prefix or infix.
  • As opposed to classic RD, where semantic actions are associated with grammar rules (BNF), TDOP associates them with tokens.

Binding power

Operator precedence and associativity is a fundamental topic to be handled by parsing techniques. TDOP handles this issue by assigning a "binding power" to each token it parses.Consider a substring AEB where A takes a right argument, B a left, and E is an expression. Does E associate with A or with B? We define a numeric binding power for each operator. The operator with the higher binding power "wins" - gets E associated with it. Lets examine the expression:1 + 2 * 4

Here it is once again with A, E, B identified:

1 + 2 * 4 ^ ^ ^ A E BIf we want to express the convention of multiplication having a higher precedence than addition, lets define the binding power (bp) of * to be 20 and that of + to be 10 (the numbers are arbitrary, whats important is that bp(*) > bp(+)). Thus, by the definition weve made above, the 2 will be associated with *, since its binding power is higher than that of +.Prefix and infix operatorsTo parse the traditional infix-notation expression languages [2], we have to differentiate between the prefix form and infix form of tokens. The best example is the minus operator (-). In its infix form it is subtraction:

a = b - c # a is b minus c

In its prefix form, it is negation:a = -b # b has as magnitude but an opposite signTo accommodate this difference, TDOP allows for different treatment of tokens in prefix and infix contexts. In TDOP terminology the handler of a token as prefix is called nud (for "null denotation") and the handler of a token as infix is called led (for "left denotation").

The TDOP algorithm

Heres a basic TDOP parser:

def expression(rbp=0):
global token
t = token
token = next()
left = t.nud()
while rbp < token.lbp:
t = token
token = next()
left = t.led(left)

return left

class literal_token(object):
def __init__(self, value):
self.value = int(value)
def nud(self):
return self.value

class operator_add_token(object):
lbp = 10
def led(self, left):
right = expression(10)
return left + right

class operator_mul_token(object):
lbp = 20
def led(self, left):
return left * expression(20)

class end_token(object):
lbp = 0

We only have to augment it with some support code consisting of a simple tokenizer [3] and the parser driver:

import re
token_pat = re.compile("s*(?:(d+)|(.))")

def tokenize(program):
for number, operator in token_pat.findall(program):
if number:
yield literal_token(number)
elif operator == "+":
yield operator_add_token()
elif operator == "*":
yield operator_mul_token()
else:
raise SyntaxError(unknown operator: %s, operator)
yield end_token()

def parse(program):
global token, next
next = tokenize(program).next
token = next()
return expression()

And we have a complete parser and evaluator for arithmetic expressions involving addition and multiplication.Now lets figure out how it actually works. Note that the token classes have several attributes (not all classes have all kinds of attributes):
  • lbp - the left binding power of the operator. For an infix operator, it tells us how strongly the operator binds to the argument at its left.
  • nud - this is the prefix handler we talked about. In this simple parser it exists only for the literals (the numbers)
  • led - the infix handler.

The key to enlightenment here is to notice how the expression function works, and how the operator handlers call it, passing in a binding power.

When expression is called, it is provided the rbp - right binding power of the operator that called it. It consumes tokens until it meets a token whose left binding power is equal or lower than rbp. Specifically, it means that it collects all tokens that bind together before returning to the operator that called it.Handlers of operators call expression to process their arguments, providing it with their binding power to make sure it gets just the right tokens from the input.Lets see, for example, how this parser handles the expression:

3 + 1 * 2 * 4 + 5

Heres the call trace of the parsers functions when parsing this expression:<<expression with rbp 0>>

<<literal nud = 3>>
<<led of "+">>
<<expression with rbp 10>>
<<literal nud = 1>>
<<led of "*">>
<<expression with rbp 20>>
<<literal nud = 2>>
<<led of "*">>
<<expression with rbp 20>>
<<literal nud = 4>>
<<led of "+">>
<<expression with rbp 10>>
<<literal nud = 5>>
The following diagram shows the calls made to expression on various recursion levels:

The arrows show the tokens on which each execution of expression works, and the number above them is the rbp given to expression for this call.

Apart from the initial call (with rbp=0) which spans the whole input, expression is called after each operator (by its led handler) to collect the right-side argument. As the diagram clearly shows, the binding power mechanism makes sure expression doesnt go "too far" - only as far as the precedence of the invoking operator allows. The best place to see it is the long arrow after the first +, that collects all the multiplication terms (they must be grouped together due to the higher precedence of *) and returns before the second + is encountered (when the precedence ceases being higher than its rbp).Another way to look at it: at any point in the execution of the parser, theres an instance of expression for each precedence level that is active at that moment. This instance awaits the results of the higher-precedence instance and keeps going, until it has to stop itself and return its result to its caller.If you understand this example, you understand TDOP parsing. All the rest is really just more of the same.
推薦閱讀:
相关文章