This is why you should never use parser combinators and PEG
Let me tell you why you should (nearly) never use PEG (parsing expression grammars). Nearly everything I will say applies to parser combinators (parsec in Haskell, nom in Rust), too.
So, don't use PEG. Use CFGs (context-free grammars) instead. They are more natural. I feel that CFGs more naturally represent how we think. Thus when you have some language in your head and you try to write it down as a grammar, there is more chances that you do this successfully if you try to write CFG grammar as opposed to PEG grammar.
(All this doesn't apply to situations where you are trying to parse some language, whose specification explicitly says that the language is based on PEG formalism as opposed to CFG formalism. One good example is Python. Python spec says Python is PEG, so you should use PEG parser for parsing Python.)
CFG and PEG are incompatible. I mean that same grammar can have different meaning depending on how you interpret it: as a CFG or as a PEG. Thus decision on whether to use CFG or PEG should happen early. I. e. when you have a grammar in you head and you try to write it down, then before you wrote even single letter to a paper (or to a text file), you should decide whether you are going to write CFG or PEG grammar. (Update 5 June 2024: this is exaggeration: of course, you can change grammar formalism later, I just meant that this is not always easy.)
Okay, you may say: “Please, give me proofs that CFG is more natural than PEG”. Okay, here we go: https://blog.reverberate.org/2013/09/ll-and-lr-in-context-why-parsing-tools.html . This article properly summarizes what are the problems with PEG I'm worrying about. Especially this part: “which means that every PEG rule could be hiding a 'conceptual' ambiguity”.
And that problem actually happened to me in practice!
I tried to parse a language described by this grammar:
t0 = t1
t0 = "!!" id "::" t0 "." t0
t1 = t2
t1 = t2 "==>" t1
t2 = t3
t3 = t4
t3 = t4 "::" t0
t3 = "%" id "::" t0 "." t3
t4 = t999
t999 = t1000
t999 = t999 t1000
t1000 = id
t1000 = "(" t0 ")"
(If you are curious, this is grammar for my pure type system checker)
To do so, I wrote a Haskell program using Parsec. I blindly converted grammar above to Haskell code. My code looked like this (you don't need to understand Haskell to see my point):
data Term = Pi String Term Term | ...
t0 = t1 <|> do {
reservedOp "!!";
a <- id;
reservedOp "::";
b <- t0;
reservedOp ".";
c <- t0;
return (Pi a b c);
}
...
In other words, I simply converted grammar above to code line-by-line. The code worked, all was okay. I wrote many files in my language. But then I accidentally discovered my grammar was ambiguous! Namely, text A :: B ==> C
can be parsed two ways: (A :: B) ==> C
and A :: (B ==> C)
. And parsec-based parser somehow chooses one of these parse trees!
So, now I have a lot of files written in this language, but I have no way to ensure they are parsed correctly, i. e. I have no way to ensure my parser chooses parse tree I meant!
Why this happened? This happened, because (as turned out) when I wrote that grammar, I meant it is CFG. Simply because my brain thinks in terms of CFG. But parsec is based on different formalism, which is closer to PEG. Thus CFG grammar was interpreted as something PEG-like, and thus parsec didn't understand what I meant.
And if you agree with me that grammar above in intuitive sense is ambiguous, i. e. if you agree that A :: B ==> C
allows two parse trees, then your brain thinks in terms of CFG, too! CFGs are simply more natural.
CFG grammar above is ambiguous, but when interpreted as PEG it is unambiguous, because PEG grammars are always unambiguous. Ambiguity is simply hidden. Josh Haberman writes about this in the article above.
All this gave me psychological trauma (figuratively speaking). I will never use Parsec (and PEG parsers) anymore.
Also, even if you are not convinced, you still should use CFG if you try to parse some language using its specification, and that specification says that the language is CFG. You cannot use CFG grammar as a PEG grammar, these formalisms are incompatible.
Moreover, if some specification simply says “grammar” or “BNF”, then in 99% cases this means it is CFG. (For example, I was unable to find in Rust reference whether presented grammar is CFG or PEG. They simply say “grammar”. But I'm sure they meant CFG, because it is simply a convention to say in specifications “grammar”, when you mean “CFG”.)
Also, CFG grammars themselves are split further into various types: LL grammars, LR grammars, etc. But all these are simply subsets of larger set of CFG grammars. And all they are compatible in the following sense: if some grammar is, say, LL grammar and LR grammar in the same time, then that grammar means the same thing when interpreted as LL and when interpreted as LR. I. e. if you enter that grammar to LL parser and to LR parser, then these two parsers will behave same way. Different parsing libraries, which target different CFG subsets (LL, LR, etc), differ in their algorithm, in scope of supported grammars, but never in semantics of particular grammar (assuming that the grammar itself is supported in both libraries).
So, if you program in Rust, don't use nom
, combine
and other parser combinator libraries. And don't use PEG parsers. Use CFG-based parsers, such as lalrpop
. (Unfortunately, lalrpop
is restricted to LR(1) grammars [this is subset of CFG]. And I was unable to find library, which combines lalrpop
benefits with support for arbitrary CFGs. It is possible some day I will write such library. If I do, then I will announce it here, so, please, subscribe!)
If you program in Haskell, don't use parsec
and its zillions of clones (gigaparsec
, attoparsec
, megaparsec
, trifecta
) and most of other combinator parser libraries. Use CFG-based libraries and parser generators, such as Earley
and happy
. But note that it is possible to write parser combinator library, which is based on unordered choice, i. e. on CFGs. For example, Earley
can be considered such a library. Also, these two parser combinator libraries are CFG-based, too: https://hackage.haskell.org/package/base-4.15.0.0/docs/Text-ParserCombinators-ReadP.html#t:ReadS , https://hackage.haskell.org/package/transformers-0.5.5.0/docs/Control-Monad-Trans-Class.html#g:5 . So, parser combinators are not necessary evil.
Unfortunately, Earley
crashes on some particular tricky grammar ( https://github.com/ollef/Earley/issues/54 ). Also, there are other things I don't like in Earley
and happy
. It is possible I will write some better library in the future. If I do, I will announce it here, so, please, subscribe! Also, see my Haskell work below.
That said, CFG is good for languages, which are “structured enough”, “well enough designed”. If you want to parse some ad hoc language, for example, some ad hoc logs, then PEG and parser combinators may be a better solution. CFG is good, when you know the language, which you want to parse, beforehand. Especially if it has specification. CFG parsing is good for “well designed languages”, such as Pascal, Javascript, Rust. It is bad for less structured languages, such as Bash, TeX, C, C++.
I'm biased, because I usually parse languages, which are either designed by me, either are “structured enough” with well written specifications.
Also, if this post gets popular, I want to take chance to advertise my own Haskell parsing libraries. Here we go:
- Library for checking whether given CFG grammar is ambiguous. This is impossible to do in general, so this library simply tries all combinations up to provided limit. I was unable to find Haskell library for same purpose, i. e. my library is unique. Moreover, I found one such library only across all programming languages! Published on Hackage. https://mail.haskell.org/pipermail/haskell-cafe/2021-May/134004.html
- Parser combinator library similar to
Earley
and based onEarley
, but which allows to wrap some monad. For example, if you want to mix type checking code with parsing code. My library is based on arrows (because it is impossible to do any other way). Of course, my library is based on unordered choice, i. e. on CFGs. My method seems to be unique. Published on Hackage. https://mail.haskell.org/pipermail/haskell-cafe/2021-July/134205.html - Parser and lexer for reversible parsing with guaranteed reversibility. Seems to be totally unique across all programming languages. Of course, CFG-based. Unfortunately, nobody answered to my email, so I don't have motivation to actually publish the library to Hackage. But if you need it, please, write me, and I will publish it. https://mail.haskell.org/pipermail/haskell-cafe/2021-July/134217.html
Here are some articles about parsing and CFGs. Their authors are better writers than me. – Already mentioned https://blog.reverberate.org/2013/09/ll-and-lr-in-context-why-parsing-tools.html – https://tratt.net/laurie/blog/2020/which_parsing_approach.html
Subscribe to this blog using button in the bottom of https://safinaskar.writeas.com/ or at Mastodon ( https://types.pl/@safinaskar ). I plan to write about formal methods, Rust and Linux.
You can reach me via email safinaskar@gmail.com
Discuss at: – https://lobste.rs/s/nybhsl/this_is_why_you_should_never_use_parser – https://types.pl/@safinaskar/112553410605946965 (Mastodon) – https://www.reddit.com/r/rust/comments/1d77mpa/this_is_why_you_should_never_use_parser/