24 days of Hackage, 2015: day 14: Earley: a promising newer parser library for Haskell
Dec 14, 2015 · 7 minute read · CommentsHaskellHackageEarleyparsecparsersambiguitynonempty lists
Table of contents for the whole series
A table of contents is at the top of the article for day 1.
Day 14
On
day 10,
I showed how to use S-expressions to avoid having to write a custom
parser. But writing parsers isn’t too bad in Haskell, or is it? The
popular parsec
library
has many problems,
because it requires hand-hacked backtracking that causes weird error
messages and difficulty in reasoning about your grammar. There’s an
improved fork of parsec
called
megaparsec
, but
it’s still the same kind of technology. How about something completely
different.
The recent Earley
is
intriguing and I’ve begun using it for new projects where I don’t need
the monadic power of something like parsec
but are OK with an
applicative API instead and don’t need the performance of something
like attoparsec
. Apart from good error messages, it allows handles
online parsing and ambiguity.
Today I’ll give two small examples of using Earley
.
Installation
Since Stackage LTS is behind right now, and Earley
keeps moving,
I decided to use the latest version of Earley
by modifying our
stack.yaml
:
- Earley-0.10.1.0
(Update of 2016-01-06)
Stackage LTS 4 has caught up, no no more need for this modification.
Parsing into day 10’s AST
Let’s go back to the symbolic differentiation problem on day 10, and create a math-like infix syntax to parse.
Tests
Here are some HSpec/QuickCheck tests to illustrate what we want when
parsing a string into an Exp
.
Imports
Text.Earley
is the main module of the Earley
package; Report
is
used for return a report on the progress of the parse.
{-# LANGUAGE QuasiQuotes #-}
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE LambdaCase #-}
module SymbolicDifferentiation.EarleySpec where
import SymbolicDifferentiation.AlphaSyntax (Exp(N, V, Plus, Times))
import qualified SymbolicDifferentiation.Earley as Earley
import Text.Earley (Report(..))
import Test.Hspec (Spec, hspec, describe, it, shouldBe, shouldSatisfy)
import Test.Hspec.QuickCheck (prop)
import Test.QuickCheck (NonNegative(..))
import Data.String.Here (i)
QuickCheck tests
Some QuickCheck tests that show that some sample expressions such as
x*a + y*b * (z+c)
parse into the expected ASTs:
spec :: Spec
spec =
describe "Custom syntax for expression parsed by Earley" $ do
-- For simplicity, don't support negative numeric literals now.
prop "x + a" $ \(NonNegative (a :: Int)) ->
fst (Earley.parses [i|x + ${a}|]) `shouldBe`
[Plus (V "x") (N a)]
prop "x*a + y*b * (z+c)" $
\(NonNegative (a :: Int))
(NonNegative (b :: Int))
(NonNegative (c :: Int)) ->
fst (Earley.parses [i|x*${a} + y*${b} * (z+${c})|]) `shouldBe`
[Plus (Times (V "x") (N a))
(Times (Times (V "y") (N b))
(Plus (V "z") (N c)))]
Expected parse errors
Finally, one example of how to check for expected parse errors. The error tokens are user-defined and attached to grammar productions, as we will see.
it "x + y * + 5" $
Earley.parses "x + y * + 5" `shouldSatisfy`
\case
([], Report { position = 8
, expected = ["number", "identifier", "("]
, unconsumed = "+ 5"
}) -> True
_ -> False
Implementation
The implementation involves Applicative
idioms that will be familiar
to you if you have used parsec
.
Imports
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE RecursiveDo #-}
{-# LANGUAGE FlexibleContexts #-}
module SymbolicDifferentiation.Earley where
import SymbolicDifferentiation.AlphaSyntax (Exp(N, V, Plus, Times))
import qualified Text.Earley as E
import Text.Earley ((<?>))
import Control.Applicative (many, some, (<|>))
import qualified Data.Char as Char
import Control.Monad.ST (ST)
import Data.ListLike (ListLike)
-- | What to report for something expected.
type Expected = String
The <?>
operator is used to attach an expectation (which we have
decided to specify as a string, with type synonym Expected
) to a
production.
Drivers
What we want for our particular problem is a parser that takes a string as input and expects to fully parse it. We construct it from a more generic parser that comes from processing our grammar.
-- | Return a list of all possible `Exp` parses, and also a status report
-- regardless of how many successes.
parses :: String -> ([Exp], E.Report Expected String)
parses = E.fullParses expParser
-- | Parser created from the grammar.
expParser :: ListLike input Char =>
ST state (input -> ST state (E.Result state Expected input Exp))
expParser = E.parser grammar
Grammar
Our grammar is straightforward. Earley
uses a monad to maintain its
internal state, and we use the RecursiveDo
GHC extension (covered in
a
2014 Day of GHC Extensions)
in order to be able to refer to a rule within the grammar
recursively. Note that left recursion in the grammar is just fine for
Earley
.
Prod
is the type constructor for a production, and you build up
productions using combinators such as satisfy
and symbol
.
-- | Basically taken from <https://github.com/ollef/Earley/blob/master/examples/Expr2.hs Earley example expression parser>
grammar :: forall r. E.Grammar r (E.Prod r Expected Char Exp)
grammar = mdo
whitespace <- E.rule $
many $ E.satisfy Char.isSpace
let token :: E.Prod r Expected Char a -> E.Prod r Expected Char a
token p = whitespace *> p
sym x = token $ E.symbol x <?> [x]
ident = token $ (:) <$> E.satisfy Char.isAlpha
<*> many (E.satisfy Char.isAlphaNum)
<?> "identifier"
num = token $ some (E.satisfy Char.isDigit) <?> "number"
-- For now, just handle unsigned numeric literals.
atom <- E.rule $
(N . read) <$> num
<|> V <$> ident
<|> sym '(' *> term <* sym ')'
factor <- E.rule $
Times <$> factor <* sym '*' <*> atom
<|> atom
term <- E.rule $
Plus <$> term <* sym '+' <*> factor
<|> factor
return $ term <* whitespace
For more examples of grammars, see the
examples directory in the Earley
GitHub repo.
(Update of 2015-12-17)
The inverse of parsing is pretty-printing, covered on day 17.
For fun: solving the “number word” problem
The ability to handle ambiguity and return all possible parses is a useful one in many situations. Here I show a solution to the “number word” problem. In the past, I have managed ambiguity using Happy’s GLR support, but I don’t like writing parsers using Happy.
The “number word” problem:
Given a positive integer, return all the ways that the integer can be
represented by letters using the mapping 1 -> A, 2 -> B, ..., 26 ->
Z. For instance, the number 1234 can be represented by the words ABCD,
AWD and LCD.
This is a toy version of an actually serious problem, that of segmentation in natural language.
Test
The test reflects the problem statement:
module EarleyExampleSpec where
import EarleyExample (grammar, NumberWord, Expected)
import qualified Text.Earley as E
import qualified Data.List.NonEmpty as NonEmpty
import Test.Hspec (Spec, hspec, describe, it, shouldMatchList)
spec :: Spec
spec =
describe "EarleyExample" $ do
it "returns all possible parses of number words" $ do
let (result, _) = parseNumberWord 1234
map NonEmpty.toList result `shouldMatchList`
["ABCD", "AWD", "LCD"]
parseNumberWord :: Integer -> ([NumberWord], E.Report Expected String)
parseNumberWord = E.fullParses (E.parser grammar) . show
Note that I am using a NonEmpty
list of Char
because an empty
string is not a valid solution to the “number word” problem. (I
covered NonEmpty
on day 7.)
Solution
The solution is just to write a grammar that tries to pick off valid
consecutive digits to make a letter. We create a production for each
possible letter that we care about, using numberLetterFor
, combine
those productions with alternation using asum
to get a composite
production numberLetter
, then use that for numberWord
which is the
grammar.
{-# LANGUAGE RecursiveDo #-}
module EarleyExample where
import qualified Text.Earley as E
import Text.Earley ((<?>))
import Control.Applicative ((<|>))
import qualified Data.Foldable as Foldable
import qualified Data.Char as Char
import qualified Data.List.NonEmpty as NonEmpty
import Data.List.NonEmpty (NonEmpty((:|)))
-- | Result wanted.
type NumberWord = NonEmpty NumberLetter
-- | 'A' to 'Z'.
type NumberLetter = Char
-- | What to report for something expected.
type Expected = String
grammar :: E.Grammar r (E.Prod r Expected Char NumberWord)
grammar = mdo
numberWord <- E.rule $
NonEmpty.cons <$> numberLetter <*> numberWord
<|> (:| []) <$> numberLetter
return numberWord
numberLetter :: E.Prod r Expected Char NumberLetter
numberLetter = (Foldable.asum . map numberLetterFor) ['A'..'Z'] <?> "number"
-- | Return a production for a given letter.
--
-- 1 is 'A', 2 is 'B', .. 26 is 'Z'.
numberLetterFor :: NumberLetter -> E.Prod r Expected Char NumberLetter
numberLetterFor c = c <$ E.word (show (toNumber c)) <?> [c]
-- | 'A' is 1, ... 'Z' is 26
toNumber :: NumberLetter -> Int
toNumber c = (Char.ord c - Char.ord 'A') + 1
Conclusion
I only recently discovered the Earley
parser library and started
using it. I’m pretty excited by its friendliness.
All the code
All my code for my article series are at this GitHub repo.