24 days of Hackage, 2015: day 14: Earley: a promising newer parser library for Haskell

Dec 14, 2015 · 7 minute read · Comments
Haskell Hackage Earley parsec parsers ambiguity nonempty lists

Table of contents for the whole series

A table of contents is at the top of the article for day 1.

Day 14

On day 10, I showed how to use S-expressions to avoid having to write a custom parser. But writing parsers isn’t too bad in Haskell, or is it? The popular parsec library has many problems, because it requires hand-hacked backtracking that causes weird error messages and difficulty in reasoning about your grammar. There’s an improved fork of parsec called megaparsec, but it’s still the same kind of technology. How about something completely different.

The recent Earley is intriguing and I’ve begun using it for new projects where I don’t need the monadic power of something like parsec but are OK with an applicative API instead and don’t need the performance of something like attoparsec. Apart from good error messages, it allows handles online parsing and ambiguity.

Today I’ll give two small examples of using Earley.

Installation

Since Stackage LTS is behind right now, and Earley keeps moving, I decided to use the latest version of Earley by modifying our stack.yaml:

- Earley-0.10.1.0

(Update of 2016-01-06)

Stackage LTS 4 has caught up, no no more need for this modification.

Parsing into day 10’s AST

Let’s go back to the symbolic differentiation problem on day 10, and create a math-like infix syntax to parse.

Tests

Here are some HSpec/QuickCheck tests to illustrate what we want when parsing a string into an Exp.

Imports

Text.Earley is the main module of the Earley package; Report is used for return a report on the progress of the parse.

{-# LANGUAGE QuasiQuotes #-}
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE LambdaCase #-}

module SymbolicDifferentiation.EarleySpec where

import SymbolicDifferentiation.AlphaSyntax (Exp(N, V, Plus, Times))
import qualified SymbolicDifferentiation.Earley as Earley
import Text.Earley (Report(..))

import Test.Hspec (Spec, hspec, describe, it, shouldBe, shouldSatisfy)
import Test.Hspec.QuickCheck (prop)
import Test.QuickCheck (NonNegative(..))

import Data.String.Here (i)

QuickCheck tests

Some QuickCheck tests that show that some sample expressions such as x*a + y*b * (z+c) parse into the expected ASTs:

spec :: Spec
spec =
  describe "Custom syntax for expression parsed by Earley" $ do
    -- For simplicity, don't support negative numeric literals now.
    prop "x + a" $ \(NonNegative (a :: Int)) ->
      fst (Earley.parses [i|x + ${a}|]) `shouldBe`
        [Plus (V "x") (N a)]

    prop "x*a + y*b * (z+c)" $
      \(NonNegative (a :: Int))
       (NonNegative (b :: Int))
       (NonNegative (c :: Int)) ->
      fst (Earley.parses [i|x*${a} + y*${b} * (z+${c})|]) `shouldBe`
        [Plus (Times (V "x") (N a))
              (Times (Times (V "y") (N b))
                     (Plus (V "z") (N c)))]

Expected parse errors

Finally, one example of how to check for expected parse errors. The error tokens are user-defined and attached to grammar productions, as we will see.

    it "x + y * + 5" $
      Earley.parses "x + y * + 5" `shouldSatisfy`
        \case
          ([], Report { position = 8
                      , expected = ["number", "identifier", "("]
                      , unconsumed = "+ 5"
                      }) -> True
          _ -> False

Implementation

The implementation involves Applicative idioms that will be familiar to you if you have used parsec.

Imports

{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE RecursiveDo #-}
{-# LANGUAGE FlexibleContexts #-}

module SymbolicDifferentiation.Earley where

import SymbolicDifferentiation.AlphaSyntax (Exp(N, V, Plus, Times))

import qualified Text.Earley as E
import Text.Earley ((<?>))
import Control.Applicative (many, some, (<|>))
import qualified Data.Char as Char
import Control.Monad.ST (ST)
import Data.ListLike (ListLike)

-- | What to report for something expected.
type Expected = String

The <?> operator is used to attach an expectation (which we have decided to specify as a string, with type synonym Expected) to a production.

Drivers

What we want for our particular problem is a parser that takes a string as input and expects to fully parse it. We construct it from a more generic parser that comes from processing our grammar.

-- | Return a list of all possible `Exp` parses, and also a status report
-- regardless of how many successes.
parses :: String -> ([Exp], E.Report Expected String)
parses = E.fullParses expParser

-- | Parser created from the grammar.
expParser :: ListLike input Char =>
             ST state (input -> ST state (E.Result state Expected input Exp))
expParser = E.parser grammar

Grammar

Our grammar is straightforward. Earley uses a monad to maintain its internal state, and we use the RecursiveDo GHC extension (covered in a 2014 Day of GHC Extensions) in order to be able to refer to a rule within the grammar recursively. Note that left recursion in the grammar is just fine for Earley.

Prod is the type constructor for a production, and you build up productions using combinators such as satisfy and symbol.

-- | Basically taken from <https://github.com/ollef/Earley/blob/master/examples/Expr2.hs Earley example expression parser>
grammar :: forall r. E.Grammar r (E.Prod r Expected Char Exp)
grammar = mdo
  whitespace <- E.rule $
    many $ E.satisfy Char.isSpace

  let token :: E.Prod r Expected Char a -> E.Prod r Expected Char a
      token p = whitespace *> p

      sym x   = token $ E.symbol x <?> [x]

      ident   = token $ (:) <$> E.satisfy Char.isAlpha
                            <*> many (E.satisfy Char.isAlphaNum)
                            <?> "identifier"
      num     = token $ some (E.satisfy Char.isDigit) <?> "number"
      -- For now, just handle unsigned numeric literals.

  atom <- E.rule $
    (N . read) <$> num
    <|> V <$> ident
    <|> sym '(' *> term <* sym ')'

  factor <- E.rule $
    Times <$> factor <* sym '*' <*> atom
    <|> atom

  term <- E.rule $
    Plus <$> term <* sym '+' <*> factor
    <|> factor

  return $ term <* whitespace

For more examples of grammars, see the examples directory in the Earley GitHub repo.

(Update of 2015-12-17)

The inverse of parsing is pretty-printing, covered on day 17.

For fun: solving the “number word” problem

The ability to handle ambiguity and return all possible parses is a useful one in many situations. Here I show a solution to the “number word” problem. In the past, I have managed ambiguity using Happy’s GLR support, but I don’t like writing parsers using Happy.

The “number word” problem:

Given a positive integer, return all the ways that the integer can be
represented by letters using the mapping 1 -> A, 2 -> B, ..., 26 ->
Z. For instance, the number 1234 can be represented by the words ABCD,
AWD and LCD.

This is a toy version of an actually serious problem, that of segmentation in natural language.

Test

The test reflects the problem statement:

module EarleyExampleSpec where

import EarleyExample (grammar, NumberWord, Expected)
import qualified Text.Earley as E
import qualified Data.List.NonEmpty as NonEmpty

import Test.Hspec (Spec, hspec, describe, it, shouldMatchList)

spec :: Spec
spec =
  describe "EarleyExample" $ do
    it "returns all possible parses of number words" $ do
      let (result, _) = parseNumberWord 1234
      map NonEmpty.toList result `shouldMatchList`
        ["ABCD", "AWD", "LCD"]

parseNumberWord :: Integer -> ([NumberWord], E.Report Expected String)
parseNumberWord = E.fullParses (E.parser grammar) . show

Note that I am using a NonEmpty list of Char because an empty string is not a valid solution to the “number word” problem. (I covered NonEmpty on day 7.)

Solution

The solution is just to write a grammar that tries to pick off valid consecutive digits to make a letter. We create a production for each possible letter that we care about, using numberLetterFor, combine those productions with alternation using asum to get a composite production numberLetter, then use that for numberWord which is the grammar.

{-# LANGUAGE RecursiveDo #-}

module EarleyExample where

import qualified Text.Earley as E
import Text.Earley ((<?>))
import Control.Applicative ((<|>))
import qualified Data.Foldable as Foldable
import qualified Data.Char as Char
import qualified Data.List.NonEmpty as NonEmpty
import Data.List.NonEmpty (NonEmpty((:|)))

-- | Result wanted.
type NumberWord = NonEmpty NumberLetter

-- | 'A' to 'Z'.
type NumberLetter = Char

-- | What to report for something expected.
type Expected = String

grammar :: E.Grammar r (E.Prod r Expected Char NumberWord)
grammar = mdo
  numberWord <- E.rule $
    NonEmpty.cons <$> numberLetter <*> numberWord
    <|> (:| []) <$> numberLetter
  return numberWord

numberLetter :: E.Prod r Expected Char NumberLetter
numberLetter = (Foldable.asum . map numberLetterFor) ['A'..'Z'] <?> "number"

-- | Return a production for a given letter.
--
-- 1 is 'A', 2 is 'B', .. 26 is 'Z'.
numberLetterFor :: NumberLetter -> E.Prod r Expected Char NumberLetter
numberLetterFor c = c <$ E.word (show (toNumber c)) <?> [c]

-- | 'A' is 1, ... 'Z' is 26
toNumber :: NumberLetter -> Int
toNumber c = (Char.ord c - Char.ord 'A') + 1

Conclusion

I only recently discovered the Earley parser library and started using it. I’m pretty excited by its friendliness.

All the code

All my code for my article series are at this GitHub repo.

The Conscientious Programmer

24 days of Hackage, 2015: day 14: Earley: a promising newer parser library for Haskell

Table of contents for the whole series

Day 14

Installation

(Update of 2016-01-06)

Parsing into day 10’s AST

Tests

Imports

QuickCheck tests

Expected parse errors

Implementation

Imports

Drivers

Grammar

(Update of 2015-12-17)

For fun: solving the “number word” problem

Test

Solution

Conclusion

All the code