For a long time, I have been puzzled by the connection between term rewriting and parsing. My intuition is that it should be possible to view parsing as the process of rewriting an input string to its syntax tree. Why would one wish to do adopt this viewpoint? Well, since many computer scientists view parsing as a solved problem, there has not been so much research activity in parsing in the past two decades. There has certainly been quite a bit, don’t get me wrong, including a number of exciting developments (I am not an expert, but I hear about these things, like GLR parsers, scannerless parsers, parsing by derivatives, etc.). In contrast, term-rewriting is a more general setting, with lots of interesting provably unsolvable problems. And nothing gets a theoretical computer scientist fired up liking solving the unsolvable. Term-rewriting researchers solve 8 unsolvable problems before breakfast, on a slow day, and the research community, while not huge, is very active and has achieved a number of impressive results (as I have commented in this blog before). So I have always wondered if we could succeed in applying the power of modern term-rewriting theory to parsing, whether we might end up with more usable, flexible, and powerful tools, instead of the unpleasant and brittle ones we limp along with today. I say these harsh things about mainstream parser generators, because in a sense, they have no interface: debugging a grammar written for an LR(1) parser generator, say, requires intimate understanding of the LR(1) parsing algorithm. At least, this is true for tools like Bison or ocamlyacc.
The similarity between parsing and term rewriting (or string rewriting) is so obvious that one would think the connection has been fully explored, particularly since both fields are very mature. Perhaps it has, but statements like the following, from the 1993 monograph “String-Rewriting Systems” by Ronald Book and Friedrich Otto, makes me think my ignorance on this point might not be due just to lack of reading the right literature yet:
One might think that results about string-rewriting systems and techniques developed for studying string-rewriting systems could be of use in studying context-free grammars and languages. However, there are few examples of this. Indeed, it would be of interest to determine precisely how the string-rewriting theory can be applied to formal language theory in general (including L-systems).
The obvious basic starting idea on this is just to interpret productions of a context-free grammar (CFG) as string-rewriting rules, by reversing the arrows. So take a sample (ambiguous) grammar like this one:
- expr -> NUM
- expr -> expr + expr
Reversing the arrows gives a string-rewriting system:
- NUM -> expr
- expr + expr -> expr
If we want to rewrite strings to their syntax trees, then we will want to have term-rewriting rules, rather than string rewriting rules. My understanding is that a standard encoding of string rewriting as term rewriting is to view every letter as a unary function symbol. So we would have
- NUM(r) -> expr(r)
- expr(+(expr(r))) -> expr(r)
Here, the variable r would match the rest of the string (encoded as a term). These rules don’t yet build syntax trees. It seems a little nicer to me, for that purpose, to use a different encoding of string rewriting as term rewriting, where letters are constants (0-ary function symbols), and there is a right-associative binary operator “.” for connecting letters into strings. By the way, on either encoding, we can use a constant symbol “$” for the end of the string:
- NUM . r -> expr . r
- expr . + . expr . r -> expr . r
Now to build syntax trees, we just need to make non-terminals unary function symbols, whose argument is the parse tree. We can assign each rule a name, and use that to label the nodes of the syntax tree associated with an application of the rule (here, I will assume that NUM has no associated data, although in reality we might use a scannerless approach where even terminal symbols are unary, holding the sequence of characters matched from the string). I am calling the first rule Num and the second Plus.
- NUM . r -> expr (Num). r
- expr(t1) . + . expr(t2) . r -> expr(Plus(t1,t2)) . r
Now we start to see a benefit of this approach: the resulting term rewriting system can be checked for confluence, and we will discover the critical pair expr(Plus(Plus(t1,t2),t3)) . r and expr(Plus(t1,Plus(t2,t3)) . r. The user could simply add a rewrite rule to indicate which associativity is preferred. We could employ termination checkers and (local) confluence checkers as needed, to ensure the term rewriting system was guaranteed to yield unique parse trees for every string. Strings not in the language of the original grammar would have parse trees that are malformed, in the sense that they still contain the “.” operator: they are partly parse trees, partly unparsed strings. That might conceivably make error reporting easier.
But finally, we come to the puzzle I wanted to share with you about all this. Ambiguity of CFGs — the existence of multiple distinct parse trees for a single string — is well-known to be undecidable. The proof in “Introduction to Automata Theory, Languages, and Computation” by Hopcroft and Ullman is by reduction from Post’s Correspondence Problem, and is quite slick. But with the translation I am suggesting, we would end up with terminating term rewriting systems (assuming we do not add any rules as I suggested for associativity above). It would seem that the analogous property to ambiguity would be confluence of the resulting rewrite system (the one that builds parse trees). But confluence of terminating term-rewriting systems is decidable: one just checks for local confluence by testing joinability of critical pairs (like the example I gave above) resulting from non-trivial overlapping of left-hand sides of rules. And joinability for a terminating system is decidable. So we seem to have a mismatch: ambiguity of CFGs is undecidable, but confluence of the translations of the CFGs is decidable. What is going on?
[You might want to ponder this for a moment — unless you already see the answer. It took me a day to get this. My answer is below. ]
There are several things to say about the relation between ambiguity of CFGs and confluence of their translations to term-rewriting systems. But basically, the fuzzily proposed connection is not precise enough. I conjecture that ambiguity of CFGs corresponds exactly to ground confluence towards a target (the start symbol) for the resulting term-rewriting system. By ground confluence, I mean confluence for variable-free terms only. And by “confluence towards a target”, I mean confluence for terms which rewrite to a certain constant symbol (or here, an application of the start symbol to a parse tree), and only such terms. Ground confluence towards a target is quite a different problem than just confluence, and that is why there is not a paradox here. In practice, it seems this means that using a confluence checker would be overly conservative: it might label a term-rewriting system as ambiguous (non-confluent) which is actually non-ambigous (ground confluent). Of course, confluence implies ground confluence, so there would be no false reports of unambiguity, only ambiguity. I am interested in exploring further whether an approach like this — and I have not talked about the issue of rewriting towards a target here, which also must be addressed — could be used to create a more usable parser generator based on term-rewriting.