Monadic and Applicative Parsing
It is sometimes the case that we have to operate on data that is not already in the form of a richly typed Haskell value, but is instead stored in a file or transmitted across the network in some serialised format - usually in the form of a sequence of bytes or characters. Although such situations are always regrettable, in this chapter we shall see a flexible technique for making sense out of unstructured data: parser combinators.
Parsing Integers Robustly
To start with, consider turning a String
of digit characters into
the corresponding Integer
. That is, we wish to construct the
following function:
readInteger :: String -> Integer
The function ord :: Char -> Int
from Data.Char
can convert a
character into its corresponding numeric code. Exploiting the fact
that the integers have consecutive codes, we can write a function for
converting a digit character into its corresponding Integer
. Note
that we have to convert the Int
produced by ord
into an Integer
:
import Data.Char (ord)
charInteger :: Char -> Integer
charInteger c = toInteger $ ord c - ord '0'
Exploiting the property that the numeric characters are consecutively
encoded, we can implement readInt
with a straightforward recursive
loop over the characters of the string, from right to left:
readInteger :: String -> Integer
readInteger s = loop 1 $ reverse s
where
loop _ [] = 0
loop w (c : cs) = charInteger c * w + loop (w * 10) cs
Example use:
> readInteger "123"
123
However, see what happens if we pass in invalid input:
λ> readInteger "xyz"
8004
Silently producing garbage on invalid input is usually considered poor
engineering. Instead, our function should return a Maybe
type,
indicating invalid input by returning Nothing
. This can be done by
using isDigit
from Data.Char
to check whether each character is a
proper digit:
readIntegerMaybe :: String -> Maybe Integer
readIntegerMaybe s = loop 1 $ reverse s
where
loop _ [] = Just 0
loop w (c : cs)
| isDigit c = do
x <- loop (w * 10) cs
pure $ charInteger c * w + x
| otherwise =
Nothing
Note how we are using the fact that Maybe
is a monad to avoid
explicitly checking whether the recursive call to loop
fails.
We now obtain the results we would expect:
> readIntegerMaybe "123"
Just 123
> readIntegerMaybe "xyz"
Nothing
Composing Parsers
Now suppose we extend the problem: we must now parse an integer,
followed by a space, followed by another integer. We can of course
write a function that does this from scratch, but it would be better
if we could reuse our function that parses a single integer.
Unfortunately, this is not possible with readIntegerMaybe
, as it
requires that the input string consists solely of digits. We could
split the string by spaces in advance, but this is rather ad-hoc.
Instead, let us construct a function that reads a leading integer
from a string, and then returns a remainder string.
readLeadingInteger :: String -> Maybe (Integer, String)
readLeadingInteger s =
case span isDigit s of
("", _) -> Nothing
(digits, s') -> Just (loop 1 $ reverse digits, s')
where
loop _ [] = 0
loop w (c : cs) =
charInteger c * w + loop (w * 10) cs
The span
function breaks a string into two parts: the prefix of
characters that satisfy the predicate (here isDigit
), and a
remainder string with the prefix removed. If the first part is
empty, we return Nothing
, as an integer requires at least a single
digit. Otherwise we convert the digits into an Integer
and return it
along with the remaining string.
> readLeadingInteger "123"
Just (123,"")
> readLeadingInteger "123 456"
Just (123," 456")
Let us also write two more helper functions: one that reads a single leading space from a string (and returns the remainder), and one that asserts that the string is empty.
readSpace :: String -> Maybe (Char, String)
readSpace (' ' : s) = Just (' ', s)
readSpace _ = Nothing
readEOF :: String -> Maybe ((), String)
readEOF "" = Just ((), "")
readEOF _ = Nothing
Note readLeadingInteger
, readSpace
, and readEOF
all have types
of the same form: String -> Maybe (a, String)
for some a
. This
strongly suggests that there is some kind of commonality that we can
exploit to construct ways of composing them. But first, let us try to
compose them manually, to solve the problem of reading two
space-separated integers:
readTwoIntegers :: String -> Maybe ((Integer, Integer), String)
readTwoIntegers s =
case readLeadingInteger s of
Nothing -> Nothing
Just (x, s') -> case readSpace s' of
Nothing -> Nothing
Just (_, s'') ->
case readLeadingInteger s'' of
Nothing -> Nothing
Just (y, s''') -> pure ((x, y), s''')
> readTwoIntegers "123 456"
Just ((123,456),"")
While it works, it is quite verbose with all that explicit
pattern-matching of Nothing
/Just
. We can exploit the fact that
Maybe
is a monad to write it a bit more concisely:
readTwoIntegers2 :: String -> Maybe ((Integer, Integer), String)
readTwoIntegers2 s = do
(x, s') <- readLeadingInteger s
(_, s'') <- readSpace s'
(y, s''') <- readLeadingInteger s''
Just ((x, y), s''')
However, it is still quite annoying that we manually have to thread
the String
around. It also means we can screw up, and use the wrong
one, since they all have the same type. It would be better if this
kind of book-keeping was done automatically behind the scenes. And
indeed that is possible, by creating a parser monad, whose structure
is essentially the same as the functions above, and which hides away
the book-keeping behind the monadic interface.
A simple parser monad
As stated above, all our parser functions are of the form String -> Maybe (a, String)
. We will make that the definition of our parser
type:
newtype Parser a = Parser {runParser :: String -> Maybe (a, String)}
Now runParser :: Parser a -> String -> Maybe (a, String)
is the
function for running a parser on some input.
This type can be made a monad. Its definition looks like this:
instance Monad Parser where
f >>= g = Parser $ \s ->
case runParser f s of
Nothing -> Nothing
Just (x, s') -> runParser (g x) s'
The idea is the following. We have f :: Parser a
and g :: a -> Parser b
. We can run the parser f
to get either a parse error or a
value x :: a
and a remainder string s'
. We can then pass x
to
g
in order to obtain a Parser b
, which we can then run on s'.
It
is quite similar to a state monad combined the Maybe
monad.
We also need to define Applicative
and Functor
instances. As
always, the only case that is not purely mechanical is pure
, which
is used for a "parser" that consumes no input and always succeeds.
import Control.Monad (ap)
instance Functor Parser where
fmap f x = do
x' <- x
pure $ f x'
instance Applicative Parser where
(<*>) = ap
pure x = Parser $ \s -> Just (x, s)
Now that we have defined the fundamental machinery, we can define some very simple primitive parsers. We start with one that parses a single character that satisfies some predicate:
satisfy :: (Char -> Bool) -> Parser Char
satisfy p = Parser $ \s -> case s of
c : cs ->
if p c
then Just (c, cs)
else Nothing
_ -> Nothing
> runParser (satisfy isDigit) "123"
Just ('1',"23")
And a parser that succeeds only if there is no more input left:
eof :: Parser ()
eof = Parser $ \s ->
case s of
"" -> Just ((), "")
_ -> Nothing
While eof
may seem a bit odd, we often use it as the very last step
of parsing a complete file or data format, as usually we do not want
to allow trailing garbage.
A parser combinator is a function on parsers. As our first parser combinator, we will construct one that accepts a list of parsers, and tries them all in turn. This is useful in the very common case where multiple inputs can be valid.
choice :: [Parser a] -> Parser a
choice [] = Parser $ \_ -> Nothing
choice (p : ps) = Parser $ \s ->
case runParser p s of
Nothing -> runParser (choice ps) s
Just (x, s') -> Just (x, s')
The parsers and combinators above directly manipulate the parser state
and use the Parser
constructor. We say they are primitive parsers.
However, the vast majority of parsers we will write will not be
primitive, but will instead be built in terms of the primitives, using
the monadic interface. For example, we can define a derived parser
that parses an expected string and returns it:
import Control.Monad (void)
chunk :: String -> Parser String
chunk [] = pure ""
chunk (c : cs) = do
void $ satisfy (== c)
void $ chunk cs
pure $ c : cs
> runParser (choice [chunk "foo", chunk "bar"]) "foo"
Just ("foo","")
Using Parser Combinators
Let us try to rewrite our integer parser using the Parser
type.
First, we define a function for parsing a single digit, including
decoding.
parseDigit :: Parser Integer
parseDigit = do
c <- satisfy isDigit
pure $ toInteger $ ord c - ord '0'
Here is how we would use it to write a function that parses two digits:
parseTwoDigits :: Parser (Integer, Integer)
parseTwoDigits = do
x <- parseDigit
y <- parseDigit
pure (x, y)
We also construct a combinator that repeatedly applies a given parser until it fails:
many :: Parser a -> Parser [a]
many p =
choice
[ do
x <- p
xs <- many p
pure $ x : xs,
pure []
]
> runParser (many parseDigit) "123"
Just ([1,2,3],"")
(Ponder what happens if we flipped the elements in the list we pass to
choice
in the definition of many
.)
We have to be careful when using many
: if it is given a parser that
can succeed without consuming any input (such as eof
), it will loop
forever.
Also, many
is not quite what we need, as it will succeed even if the
given parser succeeds zero times. The variant some
requires that the
given parser succeeds at least once:
some :: Parser a -> Parser [a]
some p = do
x <- p
xs <- many p
pure $ x : xs
Now we can write our function for parsing integers:
parseInteger = do
digits <- some parseDigit
pure $ loop 1 $ reverse digits
where
loop _ [] = 0
loop w (d : ds) =
d * w + loop (w * 10) ds
Or more concisely:
parseInteger :: Parser Integer
parseInteger = loop 1 . reverse <$> some parseDigit
where
loop _ [] = 0
loop w (d : ds) =
d * w + loop (w * 10) ds
And finally we can use it to build a parser for two space-separated integers:
parseTwoIntegers :: Parser (Integer, Integer)
parseTwoIntegers = do
x <- parseInteger
_ <- satisfy (== ' ')
y <- parseInteger
pure (x, y)
See how easy it is to understand what input it parses (once you
understand parser combinators, mind you), compared to the original
readTwoIntegers
function, which was littered with book-keeping.
> runParser parseTwoIntegers "123 456"
Just ((123,456),"")
We would likely also want to use the eof
parser to assert that no
garbage follows the second integer.
Tokenisation
Syntaxes for programming languages and data formats are usually structured in the form of rules for how to combine tokens (sometimes also called lexemes), along with rules for how the tokens are formed.
Following the example above, we could consider integers (in the form of a nonzero number of decimal digits) to be a token. In many languages, tokens can be separated by any number of whitespace characters. In the traditional view of parsing, the syntactic analysis is preceded by a lexical analysis that splits the input into tokens, using for example regular expressions. With parser combinators, lexical and syntactic analysis is done simultaneously. Yet to correctly handle the lexical rules of a language, we must exercise significant discipline.
Not all languages have straightforward lexical rules. Some (such as Python or Haskell) use indentation to structure their syntactical rules, others have context-sensitive tokenisation, and others allow extensible or user-defined syntaxes. Parser combinators can cope with all of these, but for simplicity we stick to more well-behaved syntaxes in these notes.
Whitespace
One core concern is how to handle whitespace. The most common convention is that each parser must consume any trailing whitespace, but can assume that there is no leading whitespace. If we systematically follow this rule, then we can ensure that we never fail to handle whitespace. The exception is the top parser, where we cannot assume that the initial input does not have leading whitespace. Still, systematically doing anything without mistakes is difficult for humans. In the following, we define some simple (but rigid) rules that are easy to follow in our code.
First we write a parser that skips any number of whitespace characters.
import Data.Char (isSpace)
space :: Parser ()
space = do
_ <- many $ satisfy isSpace
pure ()
Then we write a parser combinator that takes any parser and consumes subsequent whitespace:
lexeme :: Parser a -> Parser a
lexeme p = do x <- p
space
pure x
Now we institute a rule: whenever we write a parser that operates at the lexical level, it must be of the form
lFoo = lexeme $ ...
and no other parser than those of this form is allowed to use
lexeme
or space
directly. The l
prefix is a mnemonic for
lexical - similarly we will begin prefixing our syntax-level parsers
with p
.
For example, this would now be a parser for decimal integers that consumes trailing whitespace:
lDecimal :: Parser Integer
lDecimal = lexeme $ loop 1 . reverse <$> some parseDigit
where
loop _ [] = 0
loop w (d : ds) =
d * w + loop (w * 10) ds
Note that we will use parseDigit
(which does not handle whitespace)
as a low level building block. We must only use these from within
l
-prefixed functions.
Now we can easily a function that parses any number of whitespace-separated decimal numbers:
pDecimals :: Parser [Integer]
pDecimals = many lDecimal
> runParser pDecimals "123 456 789 "
Just ([123,456,789],"")
However, we will fail to parse a string with leading whitespace:
> runParser pDecimals " 123"
Just ([]," 123")
The solution is to explicitly skip leading whitespace:
> runParser (space >> pDecimals) " 123"
Just ([123],"")
We do this only in the top level parser - usually in the one that is
passed immediately to the function that executed the Parser
monad
itself.
Longest match
When lexing the string "123"
, we see it as a single token 123
rather than three tokens 1
, 2
, 3
? The reason for this is that
most grammars follow the longest match (or maximum munch) rule:
each tokens extend as far as possible. This principle has actually
been baked into the definition of the many
combinator above, as it
tries the recursive case before the terminal case. If we instead
defined many
like this:
many :: Parser a -> Parser [a]
many p =
choice
[ pure [],
do
x <- p
xs <- many p
pure $ x : xs
]
Then we would observe the following behaviour (because lDecimal
uses
some
, which uses many
):
> runParser lDecimal "123"
Just (1,"23")
Thus, simply by defining many
the way we originally did, we obtain
conventional behaviour.
A larger example
Let us study a larger example of parsing, including proper tokenisation, handling of keywords, and transforming a given grammar to make it amenable to parsing.
Consider parsing a language of Boolean expressions, represented by the following Haskell datatype.
data BExp
= Lit Bool
| Var String
| Not BExp
| And BExp BExp
| Or BExp BExp
deriving (Eq, Show)
The concrete syntax will be strings such as "x"
, "not x"
, "x and true"
, "true and y or z
". We write down a grammar in
EBNF:
var ::= ? one or more alphabetic characters ? ;
BExp ::= "true"
| "false"
| "not" BExp
| BExp "and" BExp
| BExp "or" BExp ;
The words enclosed in double quotes are terminals (tokens). The
lowercase var
is also a token, but is defined informally using an
EBNF comment. We adopt the convention that tokens can be separated by
whitespace, and that we follow the longest match rule. This is
strictly speaking an abuse of convention, as the handling of
whitespace ought also be explicit in the grammar, but it is common to
leave it out for simplicity.
Note that the grammar does not exactly match the Haskell abstract
syntax tree (AST) definition - in particular, the "true"
and
"false"
cases are combined into a single Lit
constructor. This is
quite common, and we will see many cases where superfluous details of
the form of the concrete syntax are simplified away in the AST.
Our first attempt at parsing Boolean expressions looks like this:
import Data.Char (isAlpha)
lVar :: Parser String
lVar = lexeme $ some $ satisfy isAlpha
lTrue :: Parser ()
lTrue = lexeme $ void $ chunk "true"
lFalse :: Parser ()
lFalse = lexeme $ void $ chunk "false"
lAnd :: Parser ()
lAnd = lexeme $ void $ chunk "and"
lOr :: Parser ()
lOr = lexeme $ void $ chunk "or"
pBool :: Parser Bool
pBool =
choice
[ do
chunk "true"
pure True,
do
chunk "false"
pure False,
do
chunk "not"
e <- pBExp
pure $ Not e
]
pBExp :: Parser BExp
pBExp =
choice
[ Lit <$> pBool,
Var <$> lVar,
do
x <- pBExp
lAnd
y <- pBExp
pure $ And x y,
do
x <- pBExp
lOr
y <- pBExp
pure $ Or x y
]
Note now the structure of the code fairly closely matches the structure of the grammar. This is is always something we seek to aspire to.
Keywords
Unfortunately, despite looking pretty, it fails to work properly for any but trivial cases:
> runParser pBExp "x"
Just (Var "x","")
> runParser pBExp "true"
Just (Lit True,"")
> runParser pBExp "not x"
Just (Var "not","x")
The third case doesn't work. How can that be? The reason it goes wrong can be seen by trying to parse a variable name:
> runParser lVar "not"
Just ("not","")
Words such as not
, and
, or
, true
, and false
are also valid
variable names according to our parser. While we forgot to state it
explicitly in the grammar, our intention is for these words to be
keywords (or reserved words), which are not valid as variables. So
now we add another side condition to the grammar: a var
must not be
one of not
, and
, or
, true
, and false
. How do we implement
this in the parser? After all, in lVar
we cannot know whether we
will end up reading a keyword until after we are done. We actually
need to add a new primitive operation to our parser: explicit
failure. We do this by implementing the MonadFail
type class, which
requires a single method, fail :: String -> Parser a
:
instance MonadFail Parser where
fail _ = Parser $ \_ -> Nothing
The argument to fail
is for an error message, which is not supported
by our parser definition, so we just throw it away. The result of
fail
is a parser that always fails. We can use this to fix our
definition of lVar
to explicitly not allow keywords:
keywords :: [String]
keywords = ["not", "true", "false", "and", "or"]
lVar :: Parser String
lVar = lexeme $ do
v <- some $ satisfy isAlpha
if v `elem` keywords
then fail "keyword"
else pure v
This shows some of the strength (and danger) of monadic parsing: we can intermingle arbitrary Haskell-level decision making with the purely syntactical analysis. This allows parser combinators to support heinously context-sensitive grammars when necessary, but as mentioned above, we will stick to more well-behaved ones in this course.
Now we get the desired behaviour:
> runParser pBExp "not x"
Just (Not (Var "x"),"")
But another case still behaves oddly:
> runParser pBExp "truexx"
Just (Lit True,"xx")
We don't really want to parse "truexx"
as the Boolean literal true
followed by some garbage - this is in violation of the longest match
rule. We can fix this by requiring that a keyword is not followed by
an alphabetic character. This does require us to add a new primitive
parser to our parsing library (but this is the last one):
notFollowedBy :: Parser a -> Parser ()
notFollowedBy (Parser f) = Parser $ \s ->
case f s of
Nothing -> Just ((), s)
_ -> Nothing
The notFollowedBy
combinator succeeds only if the provided parser
fails (and if so, consumes no input). We can then use this to define a
combinator for parsing keywords:
lKeyword :: String -> Parser ()
lKeyword s = lexeme $ do
void $ chunk s
notFollowedBy $ satisfy isAlpha
Using lKeyword
, there is no need for dedicated functions for parsing
the individual keywords, although you can still use them if you like.
I prefer using lKeyword
directly:
lKeyword :: String -> Parser ()
lKeyword s = lexeme $ do
void $ chunk s
notFollowedBy $ satisfy isAlpha
pBool :: Parser Bool
pBool =
choice
[ lKeyword "true" >> pure True,
lKeyword "false" >> pure False
]
pBExp :: Parser BExp
pBExp =
choice
[ Lit <$> pBool,
Var <$> lVar,
do
lKeyword "not"
Not <$> pBExp,
do
x <- pBExp
lKeyword "and"
y <- pBExp
pure $ And x y,
do
x <- pBExp
lKeyword "or"
y <- pBExp
pure $ Or x y
]
> runParser pBExp "truexx"
Just (Var "truexx","")
Left recursion
We have now implemented tokenisation properly. Unfortunately, our parser still does not work:
> runParser pBExp "x and y"
Just (Var "x","and y")
The reason is the choice
in pBExp
. Our definition of choice
takes the first parser that succeeds, which is the one that produces
Var
, and so it never proceeds to the one for And
.
There are ways of implementing parser combinators
such that the ordering does not matter, which is largely by using a
list instead of a Maybe
in the definition of Parser
. However, this
will not solve the nontermination problem discussed below.
We can try to fix our parser by changing the order of parsers provided
to choice
:
pBExp :: Parser BExp
pBExp =
choice
[ do
x <- pBExp
lKeyword "and"
y <- pBExp
pure $ And x y,
do
x <- pBExp
lKeyword "or"
y <- pBExp
pure $ Or x y,
Lit <$> pBool,
Var <$> lVar,
do
lKeyword "not"
Not <$> pBExp
]
Unfortunately, now the parser goes into an infinite loop:
> runParser pBExp "x"
... waiting for a long time ...
The operational reason is that underneath all the monadic syntax
sugar, our parsers are just recursive Haskell functions. Looking at
pBExp
, we see that the very first thing it does is recursively
invoke pBExp
. If we look at the EBNF grammar for Boolean
expressions, we also see that some of the production rules for BExp
start with BExp
. In the nomenclature of parser theory, this is
called left recursion. The style of Parser combinator library we are
studying here is equivalent to so-called recursive descent parsers
with arbitrary lookahead, which are known to not support left
recursion. The solution to this problem is to rewrite the grammar to
eliminate left-recursion. If you need a refresher on how to do this,
see Grammars and parsing with Haskell using parser
combinators,
but the idea is to split the non-recursive cases into a separate
nonterminal (often called Atom
) Transforming the grammar (note that
we do not modify the Haskell AST definition) provides us with the
following:
var ::= ? one or more alphabetic characters ? ;
Atom ::= "true"
| "false"
| var
| "not" BExp ;
BExp' ::= "and" Atom BExp'
| "or" Atom BExp'
| ;
BExp ::= Atom BExp' ;
Note that we have decided that the and
operator is
left-associative - meaning that "x and y and z"
is parsed as "(x and y) and x"
(or would be if our syntax supported parentheses).
A grammar without left-recursion can be implemented fairly
straightforwardly. The idea is that parsing a BExp
consists of
initially parsing a BExp2
, followed by a chain of zero or more
and
/or
clauses.
pAtom :: Parser BExp
pAtom =
choice
[ Lit <$> pBool,
Var <$> lVar,
do
lKeyword "not"
Not <$> pBExp
]
pBExp :: Parser BExp
pBExp = do
x <- pAtom
chain x
where
chain x =
choice
[ do
lKeyword "and"
y <- pBExp
chain $ And x y,
do
lKeyword "or"
y <- pBExp
chain $ Or x y,
pure x
]
> runParser pBExp "x and y"
Just (And (Var "x") (Var "y"),"")
> runParser pBExp "x and y and z"
Just (And (And (Var "x") (Var "y")) (Var "z"),"")
Usually when constructing a parser, we do not expose the raw parser
functions (such as pBExp
), but instead define a convenient wrapper
function, such as the following:
parseBExp :: String -> Maybe BExp
parseBExp s = fst <$> runParser p s
where
p = do
space
x <- pBExp
eof
pure x
Note how this "top level parser" also takes care to skip leading whitespace (in contrast to lexer functions that skip trailing whitespace as their principle), and asserts that no input must remain unconsumed.
Operator precedence
We still have a final problem we must address. Consider parsing the
input "x or y and z"
:
> parseBExp "x or y and z"
Just (And (Or (Var "x") (Var "y")) (Var "z"))
Whether this is correct or not of course depends on how the grammar is
specified, but the usual convention in logical formulae is that
conjunction (and
) binds tighter than disjunction (or
). This is
similar to how mathematical notation assigns higher priority to
multiplication than addition. Generally, a grammar specification will
come with a set of side conditions specifyint an operator priority
(and associativity). The way to handle operator priority in a parser
build with combinators is to perform yet another a grammar
transformation. The idea is to split the grammar rules into multiple
levels, with one level per priority. For our Boolean expressions,
the transformed grammar looks like this:
var ::= ? one or more alphabetic characters ? ;
Atom ::= "true"
| "false"
| var
| "not" BExp ;
BExp1' ::= "and" Atom BExp1'
| ;
BExp1 ::= Atom BExp1' ;
BExp0' ::= "or" BExp1 BExp0'
| ;
BExp0 ::= BExp1 BExp0' ;
BExp ::= BExp0 ;
And performing the corresponding transformation on our parser (or simply rewriting it from scratch, given this new grammar) produces this:
pBExp1 :: Parser BExp
pBExp1 = do
x <- pAtom
chain x
where
chain x =
choice
[ do
lKeyword "and"
y <- pAtom
chain $ And x y,
pure x
]
pBExp0 :: Parser BExp
pBExp0 = do
x <- pBExp1
chain x
where
chain x =
choice
[ do
lKeyword "or"
y <- pBExp1
chain $ Or x y,
pure x
]
pBExp :: Parser BExp
pBExp = pBExp0
And now we observe the parser result that we desire:
> parseBExp "x or y and z"
Just (Or (Var "x") (And (Var "y") (Var "z")))
Megaparsec
While the parser library implemented above is fully operational, it has serious flaws that leave it unsuitable for production use:
-
It is rather inefficient. This is partially because of the use of
String
as the fundamental type, but mostly because of howchoice
is implemented, which has to keep track of the original input essentially forever, even if backtracking will never become relevant. -
It produces no error messages, instead simply returning
Nothing
.
While point 1 does not matter much for AP, point 2 makes it very difficult to debug your parsers - which is certainly going to have an impact. For the exercises and assignments, you will therefore be using a state-of-the-art industrial-strength parser combinator library: Megaparsec.
The downside of using an industrial parser library such as Megaparsec is that is is complicated. It has a rich programming interface and more complicated types than what we need in AP. However, a subset of Megaparsec's interface is identical to the interface presented above (this is not a coincidence), and this is what we will be using in AP.
The facilities we will need are from the
Text.Megaparsec
module. Megaparsec is quite well documented, so it may be worth your
time to skim the documentation, although the information provided here
is intended to be sufficient for our needs. In Megaparsec, parsers are
monadic functions in the Parsec e s
monad, which is itself a
specialisation of the ParsecT
monad transformer - a concept that
lies outside of the AP curriculum. The point of this flexibilty is to
be generic in the underlying stream type (e.g. the kind of "string"
we are parsing), the form that errors can take, and so on. We do not
need such flexibility, and the first thing we need to do when using
Megaparsec is to define the following type synonym:
import Data.Void (Void)
type Parser = Parsec Void String
This states that our Parser
s will have no special error component,
and the input will be a String
.
To run such a Parser
, we make use of the runParser
function, which
in simplified form has this type:
runParser :: Parser a
-> String
-> String
-> Either (ParseErrorBundle String Void) a
The first String
is the filename of the input, which is used for
error messages. The second String
is the actual input. The result is
either a special error value, or the result of parsing. Note that in
contrast to our runParser
, no remainder string is returned. The
error value can be turned into a human-readable string with the
function errorBundlePretty
.
For example, this is how we would define parseBExp
using Megaparsec.
The rest of the parser code is (for now) completely unchanged:
parseBExp :: FilePath -> String -> Either String BExp
parseBExp fname s = case runParser p fname s of
Left err -> Left $ errorBundlePretty err
Right x -> Right x
where
p = do
space
x <- pBExp
eof
pure x
We are using the FilePath = String
type synonym in the function type
to make it clearer which is the filename and which is the input
string.
It mostly works just as before:
> parseBExp "input" "x and y and z"
Right (And (And (Var "x") (Var "y")) (Var "z"))
But it can now also produce quite nice error messages:
> either putStrLn print $ parseBExp "input" "x and y and"
input:1:12:
|
1 | x and y and
| ^
unexpected end of input
expecting "false", "not", or "true"
> either putStrLn print $ parseBExp "input" "x true"
input:1:3:
|
1 | x true
| ^
unexpected 't'
expecting "and", "or", or end of input
However, some inputs now produce a rather unexpected error:
> either putStrLn print $ parseBExp "input" "truex"
input:1:5:
|
1 | truex
| ^
unexpected 'x'
The reason for this is that Megaparsec, for efficiency reasons, does
not automatically backtrack when a parser fails. Due to the way we
have ordered our choice
in pAtom
, we will initially try to parse
the literal true
with lKeyword
in pBool
, which will read the
input true
, and then fail due to notFollowedBy
. However, the
input remains read, which means Megaparsec's implementation of
choice
won't even try to the other possibilities in choice
. The
way to fix this is to use the try
combinator, which has this type
(specialising to our Parser
monad):
try :: Parser a -> Parser a
A parser try p
behaves like p
, but ensures the the parser state is
unmodified if p
fails. We must use it whenever a parser can fail
after successfully consuming input. In this case, we must use it in
lVar
and lKeyword
:
lVar :: Parser String
lVar = lexeme $ try $ do
v <- some $ satisfy isAlpha
if False && v `elem` keywords
then fail "keyword"
else pure v
lKeyword :: String -> Parser ()
lKeyword s = lexeme $ try $ do
void $ chunk s
notFollowedBy $ satisfy isAlpha
When to use try
is certainly rather un-intuitive at first, and
remains fairly subtle for ever. One possibility is to always use it
in the cases we pass to choice
- this will work, but is inefficient,
as it makes every choice
a potential backtracking point. Most
grammars are designed such that backtracking is only needed for the
lexical functions.
Applicative parsing
As you may remember, all Monad
s are also Applicative
s. For
parsing, we can exploit this to write our parsers in a particularly
concise form called applicative style. The technique inolves to use
the <$>
operator (from Functor
) and the <*>
operator (from
Applicative
) to directly intermix the constructors, that put
together data, with the parsers that produce it. For example, recall
our definition of many
:
many :: Parser a -> Parser [a]
many p =
choice
[ do
x <- p
xs <- many p
pure $ x : xs,
pure []
]
In the first choice
case, we are are essentially running the
computation p
, then many p
, then combining their results with the
list constructor. Using applicative style and the prefix form of the
list instructor (:)
, we can instead write many
as:
many :: Parser a -> Parser [a]
many p =
choice
[ (:) <$> p <*> many ps,
pure []
]
Another example is parseTwoDigits
, which we can rewrite as follows:
parseTwoDigits :: Parser (Integer, Integer)
parseTwoDigits = (,) <$> parseDigit <*> parseDigit
In fact, we can write many monadic computations in applicative style,
but parsing benefits significantly. Two other useful applicative
operators are (<*)
and (*>)
:
(<*) :: Applicative f => f a -> f b -> f a
(*>) :: Applicative f => f a -> f b -> f b
They accept two applicative (or monadic) computations as arguments,
and then return the value of respectively the first or the second
operand, discarding the other. This is quite useful for handling
grammars where syntactical constructs are surrounded by keywords,
which must be parsed, but do not produce any interesting values. For
example, recall our definition of pBool
:
pBool :: Parser Bool
pBool =
choice
[ do
lKeyword "true"
pure True,
do
lKeyword "false"
pure False
]
Using applicative style we might write this as:
pBool :: Parser Bool
pBool =
choice
[ lKeyword "true" *> pure True,
lKeyword "false" *> pure False
]
As another example, we might add support for parenthetical grouping like so:
parens :: Parser a -> Parser a
parens p = lexeme (chunk "(") *> p <* lexeme (chunk ")")
pAtom :: Parser BExp
pAtom =
choice
[ Lit <$> pBool,
Var <$> lVar,
do
lKeyword "not"
Not <$> pBExp,
parens pBExp
]
Applicative style goes beyond merely notational convenience. It is
possible to construct parser combinator libraries that are solely
applicative, and not monadic, which allows the parser to be inspected
and transformed in a deeper way, because there is no impenetrable
>>=
operator. Applicative parsers are however fundamentally less
powerful than monadic ones - specifically, they can handle only
context-free languages.
Parsing Comments
Most grammars allow some form of comment notation that is ignored when
parsing, but hopefully provides useful information to human readers.
In this section we will extend the parser described above to support
classic Unix-style line comments, where the character #
begins a
comment that continues to the next linebreak. For example, we want the
following string to parse:
x # here goes a comment
and # and
y # another
# one
It turns out that comments are quite easy to support in a
combinator-based parser. Specifically, we simply treat comments as a
kind of whitespace and implement them in our space
combinator.
Recall that it currently looks like this:
space :: Parser ()
space = do
_ <- many $ satisfy isSpace
pure ()
First we write a parser that parses a line comment.
comment :: Parser ()
comment = void $ do
void $ chunk "#"
many $ satisfy (/= '\n')
> parseTest comment "#foo"
()
> parseTest comment "#foo\n"
()
> parseTest comment "foo"
1:1:
|
1 | foo
| ^
unexpected 'f'
expecting '#'
As desired, it succeeds only when we parse a comment. Now we modify
the space
parser to parse whitespace, followed by a comment, and
if we parse a comment, then we recursively invoke space
. This
ensures space
will onsume arbitrary combinations of spaces and
comments.
> parseTest (space *> eof) "\n#foo\n#bar\n"
()
Because of the principled design of our parser, that is actually all we need to in order to support comments:
> putStrLn "x # comment\nand y"
x # comment
and y
λ> parseBExp "" "x # comment\nand y"
Right (And (Var "x") (Var "y"))