Askar Safin

Rust, Linux, programming languages, type systems

Let me tell you why you should (nearly) never use PEG (parsing expression grammars). Nearly everything I will say applies to parser combinators (parsec in Haskell, nom in Rust), too.

So, don't use PEG. Use CFGs (context-free grammars) instead. They are more natural. I feel that CFGs more naturally represent how we think. Thus when you have some language in your head and you try to write it down as a grammar, there is more chances that you do this successfully if you try to write CFG grammar as opposed to PEG grammar.

(All this doesn't apply to situations where you are trying to parse some language, whose specification explicitly says that the language is based on PEG formalism as opposed to CFG formalism. One good example is Python. Python spec says Python is PEG, so you should use PEG parser for parsing Python.)

CFG and PEG are incompatible. I mean that same grammar can have different meaning depending on how you interpret it: as a CFG or as a PEG. Thus decision on whether to use CFG or PEG should happen early. I. e. when you have a grammar in you head and you try to write it down, then before you wrote even single letter to a paper (or to a text file), you should decide whether you are going to write CFG or PEG grammar. (Update 5 June 2024: this is exaggeration: of course, you can change grammar formalism later, I just meant that this is not always easy.)

Okay, you may say: “Please, give me proofs that CFG is more natural than PEG”. Okay, here we go: https://blog.reverberate.org/2013/09/ll-and-lr-in-context-why-parsing-tools.html . This article properly summarizes what are the problems with PEG I'm worrying about. Especially this part: “which means that every PEG rule could be hiding a 'conceptual' ambiguity”.

And that problem actually happened to me in practice!

I tried to parse a language described by this grammar:

t0 = t1
t0 = "!!" id "::" t0 "." t0
t1 = t2
t1 = t2 "==>" t1
t2 = t3
t3 = t4
t3 = t4 "::" t0
t3 = "%" id "::" t0 "." t3
t4 = t999
t999 = t1000
t999 = t999 t1000
t1000 = id
t1000 = "(" t0 ")"

(If you are curious, this is grammar for my pure type system checker)

To do so, I wrote a Haskell program using Parsec. I blindly converted grammar above to Haskell code. My code looked like this (you don't need to understand Haskell to see my point):

data Term = Pi String Term Term | ...

t0 = t1 <|> do {
  reservedOp "!!";
  a <- id;
  reservedOp "::";
  b <- t0;
  reservedOp ".";
  c <- t0;
  return (Pi a b c);
}

...

In other words, I simply converted grammar above to code line-by-line. The code worked, all was okay. I wrote many files in my language. But then I accidentally discovered my grammar was ambiguous! Namely, text A :: B ==> C can be parsed two ways: (A :: B) ==> C and A :: (B ==> C). And parsec-based parser somehow chooses one of these parse trees!

So, now I have a lot of files written in this language, but I have no way to ensure they are parsed correctly, i. e. I have no way to ensure my parser chooses parse tree I meant!

Why this happened? This happened, because (as turned out) when I wrote that grammar, I meant it is CFG. Simply because my brain thinks in terms of CFG. But parsec is based on different formalism, which is closer to PEG. Thus CFG grammar was interpreted as something PEG-like, and thus parsec didn't understand what I meant.

And if you agree with me that grammar above in intuitive sense is ambiguous, i. e. if you agree that A :: B ==> C allows two parse trees, then your brain thinks in terms of CFG, too! CFGs are simply more natural.

CFG grammar above is ambiguous, but when interpreted as PEG it is unambiguous, because PEG grammars are always unambiguous. Ambiguity is simply hidden. Josh Haberman writes about this in the article above.

All this gave me psychological trauma (figuratively speaking). I will never use Parsec (and PEG parsers) anymore.

Also, even if you are not convinced, you still should use CFG if you try to parse some language using its specification, and that specification says that the language is CFG. You cannot use CFG grammar as a PEG grammar, these formalisms are incompatible.

Moreover, if some specification simply says “grammar” or “BNF”, then in 99% cases this means it is CFG. (For example, I was unable to find in Rust reference whether presented grammar is CFG or PEG. They simply say “grammar”. But I'm sure they meant CFG, because it is simply a convention to say in specifications “grammar”, when you mean “CFG”.)

Also, CFG grammars themselves are split further into various types: LL grammars, LR grammars, etc. But all these are simply subsets of larger set of CFG grammars. And all they are compatible in the following sense: if some grammar is, say, LL grammar and LR grammar in the same time, then that grammar means the same thing when interpreted as LL and when interpreted as LR. I. e. if you enter that grammar to LL parser and to LR parser, then these two parsers will behave same way. Different parsing libraries, which target different CFG subsets (LL, LR, etc), differ in their algorithm, in scope of supported grammars, but never in semantics of particular grammar (assuming that the grammar itself is supported in both libraries).

So, if you program in Rust, don't use nom, combine and other parser combinator libraries. And don't use PEG parsers. Use CFG-based parsers, such as lalrpop. (Unfortunately, lalrpop is restricted to LR(1) grammars [this is subset of CFG]. And I was unable to find library, which combines lalrpop benefits with support for arbitrary CFGs. It is possible some day I will write such library. If I do, then I will announce it here, so, please, subscribe!)

If you program in Haskell, don't use parsec and its zillions of clones (gigaparsec, attoparsec, megaparsec, trifecta) and most of other combinator parser libraries. Use CFG-based libraries and parser generators, such as Earley and happy. But note that it is possible to write parser combinator library, which is based on unordered choice, i. e. on CFGs. For example, Earley can be considered such a library. Also, these two parser combinator libraries are CFG-based, too: https://hackage.haskell.org/package/base-4.15.0.0/docs/Text-ParserCombinators-ReadP.html#t:ReadS , https://hackage.haskell.org/package/transformers-0.5.5.0/docs/Control-Monad-Trans-Class.html#g:5 . So, parser combinators are not necessary evil.

Unfortunately, Earley crashes on some particular tricky grammar ( https://github.com/ollef/Earley/issues/54 ). Also, there are other things I don't like in Earley and happy. It is possible I will write some better library in the future. If I do, I will announce it here, so, please, subscribe! Also, see my Haskell work below.

That said, CFG is good for languages, which are “structured enough”, “well enough designed”. If you want to parse some ad hoc language, for example, some ad hoc logs, then PEG and parser combinators may be a better solution. CFG is good, when you know the language, which you want to parse, beforehand. Especially if it has specification. CFG parsing is good for “well designed languages”, such as Pascal, Javascript, Rust. It is bad for less structured languages, such as Bash, TeX, C, C++.

I'm biased, because I usually parse languages, which are either designed by me, either are “structured enough” with well written specifications.

Also, if this post gets popular, I want to take chance to advertise my own Haskell parsing libraries. Here we go:

  • Library for checking whether given CFG grammar is ambiguous. This is impossible to do in general, so this library simply tries all combinations up to provided limit. I was unable to find Haskell library for same purpose, i. e. my library is unique. Moreover, I found one such library only across all programming languages! Published on Hackage. https://mail.haskell.org/pipermail/haskell-cafe/2021-May/134004.html
  • Parser combinator library similar to Earley and based on Earley, but which allows to wrap some monad. For example, if you want to mix type checking code with parsing code. My library is based on arrows (because it is impossible to do any other way). Of course, my library is based on unordered choice, i. e. on CFGs. My method seems to be unique. Published on Hackage. https://mail.haskell.org/pipermail/haskell-cafe/2021-July/134205.html
  • Parser and lexer for reversible parsing with guaranteed reversibility. Seems to be totally unique across all programming languages. Of course, CFG-based. Unfortunately, nobody answered to my email, so I don't have motivation to actually publish the library to Hackage. But if you need it, please, write me, and I will publish it. https://mail.haskell.org/pipermail/haskell-cafe/2021-July/134217.html

Here are some articles about parsing and CFGs. Their authors are better writers than me. – Already mentioned https://blog.reverberate.org/2013/09/ll-and-lr-in-context-why-parsing-tools.htmlhttps://tratt.net/laurie/blog/2020/which_parsing_approach.html

Subscribe to this blog using button in the bottom of https://safinaskar.writeas.com/ or at Mastodon ( https://types.pl/@safinaskar ). I plan to write about formal methods, Rust and Linux.

You can reach me via email safinaskar@gmail.com

Discuss at: – https://lobste.rs/s/nybhsl/this_is_why_you_should_never_use_parserhttps://types.pl/@safinaskar/112553410605946965 (Mastodon) – https://www.reddit.com/r/rust/comments/1d77mpa/this_is_why_you_should_never_use_parser/

This post is answer to Jesper Cockx's article “1001 Representations of Syntax with Binding”, so you can read it first (but this is not required).

Note: I will be biased towards practical programs (compilers and interpreters), as opposed to developing metatheory.

Also note that I'm not expert. I have little experience with de Bruijn indices and similar things.

Let me add some methods and comparison criteria to Jesper's article.

First of all, let me introduce one method not mentioned in the Jesper's article: randomly generated globally unique identifiers for all bound variables! I didn't find this method anywhere, so it is possible it has some disadvantages I am not aware of.

I will discuss pros and cons of this method later, but first let me share some criteria not mentioned in the article.

First, ability to perform capture-avoiding substitution for lambda-terms in more or less sane way. (Yes, I'm aware this criteria is not so clearly formulated.) Naive approach (i. e. when you just use strings for variables) obviously fails to confirm to this criteria. On the other hand, de Bruijn indices do meet this criteria. In fact, nearly all existing methods meet this criteria (this is why they are created!). This criteria is so obvious that it is possibly not worth to be mentioned. Yet let me mention it anyway.

Next criteria: any variable is always encoded same. Good example of method, which fails to satisfy this criteria: de Bruijn indices. Let me explain. Consider this term:

λ x. λ y. λ z. (x y)

Let's assume for some reason we want to move subterm x y out of λ z. (x y) term (to get λ x. λ y. (x y)). Then we cannot just move the term as is. Because encoding of x y depends on where the term is located. So we need to change the term's representation. So, de Bruijn indices don't satisfy this criteria.

On the other hand, both naive encoding and unique identifiers do satisfy this criteria.

At this point some de Bruijn indices fan may say: “Wait! You just use de Bruijn indices wrong way! You should not try to change subterms located deeply inside other terms. You should use locally nameless representation (from “1001 Representations of Syntax with Binding”). Then when you want to mess with some deeply located subterm, you should first open your big term using openT from “1001 Representations of Syntax with Binding”, possibly multiple times, and then change that deeply located subterm needed way.

Well, yes, I agree. If we call openT repeatedly, then indeed this de Bruijn indices disadvantage disappears. But this means that every time we want to modify something inside a term, we need to open it!

Let's assume we are writing optimizing compiler. Job of optimizing compiler is to modify some small parts located inside of AST. With de Bruijn indices every time we want to change some small part of AST we will need first to call openT repeatedly, reach needed part, change it, and then closeT many times again. This is very artificial. And very slow. Our compiler will simply be too slow.

So, I still think that “any variable is always encoded same” is important property and that de Bruijn indices violate this property.

The other advantage of “any variable is always encoded same” is simple and natural equality function for subterms. Consider this term:

λ x. λ y. ((λ z. (x y)) (x y))

This term contains subterm x y two times, yet they are encoded differently with de Bruijn indices. And moreover, we simply cannot write function, which will compare them for alpha-equivalence based on terms alone. Because to compare them we must consider not only terms themselves, but also their location! So, with de Bruijn indices we either need some complicated alpha equality function, which deals with locations, either we again need to call openT many times! Let's again consider optimizer compiler use case. In optimizing compiler we often need to compare some subterms (located deeply inside of bigger term) to decide whether we need to apply some optimization. So, merely to determine whether we need an optimization, we need to call openT many times! And if we call openT N times and our big term has size M, then, as well as I understand, we need O(NM) operations for openT alone!

Next criteria: no need to came up with artificial numbering of variables. Consider this Rust source:

fn f(x: i32, y: i32) {
  let z = 0;
  let [t, u] = [0, 0];
  match foo {
    None => {},
    Some(v) => println!(x), // <-- how to encode "x" here?
  }
}

How to encode x in println!(x) with de Bruijn indices? To do this we need to specify how we assign order to all this bindings. Bindings can be introduced by various ways. In this example you can see these ways: – As function argument – As normal let binding – As destructive let binding (let [t, u]) – As binding in match (Some(v) =>)

And we need to came up with methods for encoding all these using de Bruijn indices! So de Bruijn indices fail to satisfy this criteria. But naive approach and unique identifiers do satisfy it!

Next: ability to perform trivial copy of terms without restrictions. Programming languages, such as Haskell and Agda, have a feature: they allow us to copy values very easily. Implicit copying of values is so easy that we did not even think about it. Yet unrestricted copy of terms ruin one of methods, namely unique identifiers. Because if we copy terms with lambdas inside, then our unique identifiers become not so unique anymore. You may say: “So what? What problems this will cause?” Let me describe. Consider this term:

(λ y. y y) (λ x. λ z. x z)

Let's assume we use unique IDs. We will show these unique IDs in square brackets:

(λ y[2]. y[2] y[2]) (λ x[1]. λ z[3]. x[1] z[3])

Now let's do one beta reduction. Let's assume that our beta reduction algorithm is buggy: it blindly performs naive copy of terms. In this case we will get this:

(λ x[1]. λ z[3]. x[1] z[3]) (λ x[1]. λ z[3]. x[1] z[3])

As you can see our unique IDs are not so unique anymore: we have two different x[1] variables. But what is wrong with this? Well, let's continue our beta reductions to understand where the problem lies. So, let's do another beta reduction:

λ z[3]. (λ x[1]. λ z[3]. x[1] z*[3]) z[3]

Now, ask yourself: what means that z*[3]? (I marked it with * to easily refer to it.) It is located inside two lambda expressions, which both bind z[3]. Oops.

So, this means that if we use unique IDs, then we cannot copy our term using trivial copy (which is implicit in Haskell, Agda, etc). Instead we should use “smart” user-defined copy. I think this is one of the reasons why unique IDs are not popular among functional programmers. Yet there is no any problems in Rust, because Rust has support for affine types! Affine types are types, which are not copied implicitly! So, to implement this in Rust, we simply make our term non-Copy type and implement user-defined clone.

Update 16 May 2024. If we use de Bruijn indices, we have some problems with copy, too: if a term is located inside a bigger term, then we cannot simply copy it, we sometimes need to do openT, as discussed above. Yet, if this is a standalone term, then we can copy it without problems, as opposed to unique IDs method.

Next criteria: no need for external method for generating IDs. “Unique IDs” method fails to satisfy this criteria: you need some method for generating IDs. I. e. counter or random number generator. In both cases you cannot generate IDs from pure computation in pure functional language, such as Haskell or Agda. You need to live inside of some kind of random number generator monad. I think this is another reason why functional programmers don't like unique IDs. Yet this is not problem in imperative languages, such as Rust.

Next criteria: basic operations have no more than linear complexity. As I mentioned above, de Bruijn indices fail to satisfy this criteria: if want to modify (or even inspect) subterm, located deeply inside another term, you need to make O(NM) operations.

Now let me show you my comparison table.

“LoNa” – locally nameless de Bruijn indices “Unique” – unique IDs

LoNa Unique
ability to perform capture-avoiding substitution for lambda-terms in more or less sane way Yes Yes
any variable is always encoded same No Yes
no need to came up with artificial numbering of variables No Yes
ability to perform trivial copy of terms without restrictions Yes No
no need for external method for generating IDs Yes No
basic operations have no more than linear complexity No Yes

Okay, what we should choose? Well, as you probably guessed, I strongly believe one should always choose unique IDs, if they are creating practical program (compiler or interpreter, etc). No matter what is your implementation language (Rust, Haskell, Agda, etc), you should always choose unique IDs for practical programs. (Yes, unique IDs probably are not so good if you are developing metatheory.)

Why? Because unique IDs are easy to understand AND simple to implement AND fast. I described above, why I think so.

Unique IDs allow you to easily move subterm inside of larger term without need to re-encode it. Unique IDs allow you to inspect or modify subterm inside of larger term without need to repeatedly call openT (which is both annoying AND very slow, O(NM) slow). And unique IDs are easier to understand for humans than these strange de Bruijn indices.

Yes, in languages such as Haskell and Agda you need to be careful with copies, and you need to be inside of random number generator monad (or counter monad). So what? I think this is minor point. And you don't have these problems with monads and copies, if your implementation language is Rust.

Update 16 May 2024: Jesper's answer: https://agda.club/notice/AhvAH3SYkaU7vboRoO

This is my first post in this blog. Subscribe to this blog using button in the bottom of https://safinaskar.writeas.com/ or at Mastodon ( https://types.pl/@safinaskar ). I plan to write about formal methods, Rust and Linux.

You can reach me via email safinaskar@gmail.com

Discuss at: https://lobste.rs/s/lbzg6m/this_is_why_you_should_never_use_de_bruijn https://types.pl/@safinaskar/112442405322545975 (Mastodon) https://www.reddit.com/r/rust/comments/1cs8iq3/this_is_why_you_should_never_use_de_bruijn/

Enter your email to subscribe to updates.