24 days of Hackage, 2015: day 2: Regexes with pcre-heavy; standalone Haskell scripts using Stack
Dec 2, 2015 · 11 minute read · CommentsHaskellHackagepcre-heavyregexesparsersparsecPerlPCREPittsburgh TechFestTemplate Haskell
Table of contents for the whole series
A table of contents is at the top of the article for day 1.
Day 2
Don’t laugh, but once upon a time, I made Perl my main programming language of choice (between around 1999 and 2010). There were many reasons for this, but one reason was that Perl made it very easy to do text processing using regexes.
If you are a seasoned Haskeller, you might be thinking, “Why not use a real parser instead?“, such as the venerable parsec, which was covered in a 2012 day of Hackage? (Or, today, one could consider one of several other newer alternative libraries for parsing. A later day of Hackage will say more about this!)
After all, Jamie Zawinski famously once wrote, “Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems.” I even gave a talk at Pittsburgh Tech Fest in 2013, “Stop overusing regular expressions!”, in which I promoted writing parsers rather than writing regexes.
But, sometimes I do want to use a regex. In that case, I have been
using an obscure but useful package, pcre-heavy
.
Today I’ll show how to use pcre-heavy
, and while at it, also show
how to ship one-file standalone Haskell scripts that only require
Stack.
Why use regexes at all?
Before going into pcre-heavy
, I thought I should explain when I use
regexes.
Back when I was doing a lot of text extraction, cleaning, including correction, restructuring of messy data, regexes seemed the only choice really. I had to not lose any “intended” information even if it was obscured by garbage or misspellings or the like. I therefore could not use some kind of approximate statistical technique, but had to iteratively do do a lot exploratory work with some interactive prompting in order to gradually clean up the data. Super-powerful regex constructs of the Perl variety seemed perfect for this task.
But even outside of such use cases, there’s no hiding from the fact that regexes can be very convenient for simple tasks. Also, because regexes are used so much in our programming world in general, if we are migrating to Haskell some already-working regexes from already-written code in some other language, it’s convenient to just stick with regexes.
Which Haskell regex library to use?!
A newcomer to Haskell must be overwhelmed by the lack of a single standard library and syntax for regexes. I mean, take a look at this wiki page.
Today, I’m presenting
pcre-heavy
, a
regex library that I’ve been using when I want regexes at all (I try
not to want them). It’s pretty new and not even mentioned on that wiki
page.
Some of my criteria for choosing a regex library:
- I want Perl-style regexes. That’s what I’m used to and are a kind of standard across regex support in many programming languages.
- Nice syntax is a plus. One of the selling points of using regexes is that the conciseness of writing patterns, binding matches, etc. Without such conciseness, I just think “Why not just write a real parser? It only takes a couple of lines in Haskell anyway.”
- High performance is a perfectly legitimate reason to use regexes.
Given these criteria, using a PCRE-based library seemed the way to go. OK, the wiki page lists a bunch of PCRE-based libraries.
pcre-light
is a
good way to go.
It does require installation of the C library for PCRE.
I’m mainly on Mac OS X, so I have PCRE installed through
Homebrew with $ brew install pcre
. I have PCRE working on
Linux. Unfortunately, I don’t use Windows, so if someone can verify
that pcre-light
installs OK on Windows, that would be great. I would
feel sad if I picked a library that is problematic for Windows users.
Recently, out came
pcre-heavy
, a
wrapper around pcre-light
that uses
Template Haskell, the GHC
extension that is “macros for Haskell”, enabling compile-time
metaprogramming (see the
2014 Day of Hackage article about Template Haskell.
I liked it, so I use it.
Example program using pcre-heavy
pcre-heavy
has decent documentation on
its Hackage page, so
I recommend reading that for the full details on how to use it. I’ll
give just a simple example here in the context of a complete program
that does something.
Specification and some test cases
Say we have a file of lines of text that are supposed to have a comma-separated format of
- a fixed header
- a text transcript’s file path
- an “audio” or “video” field indicating the type of associated media
- an optional annotation about whether the associated media is missing or not yet linked into the transcript
(I made up this example based on the structured text specification called CHAT that happens to include a single line of this format, e.g. this coded Supreme Court oral argument transcript for “Citizens United v. Federal Election Commission”.)
Examples that should match:
@Media: has-audio, audio
@Media: has-video,video
@Media: has-audio-but-missing, audio, missing
@Media: has-video-but-unlinked , video, unlinked
Examples that should fail to match:
@Media: no-audio-or-video
@Media: missing-media-field, unlinked
Creating a regex
Here is a pcre-heavy
regex, using the
re
Template Haskell
quasiquoter
that builds a PCRE-compiled
Regex
:
mediaRegex :: Regex
mediaRegex = [re|^@Media:\t([^ ,]+)\ *,\ *(audio|video)(\ *,\ *(?:missing|unlinked))?|]
Regex string validated at Haskell compile-time
One selling point of pcre-heavy
for me is that because it uses
Template Haskell, a bad regex string results in a Haskell-level
compile-time error rather than a runtime error.
Example of a compile-time error:
-- This Haskell code fails to compile!
mediaRegex :: Regex
mediaRegex = [re|^@Media:\t([^ ,]+)\ *,\ *(audio|video)(\ *,\ *(?:missing|unlinked)?|]
Loading this in GHCi or compiling with GHC results in
Exception when trying to run compile-time code:
Text.Regex.PCRE.Light: Error in regex: missing )
Code: template-haskell-2.10.0.0:Language.Haskell.TH.Quote.quoteExp
re
"^@Media:\\t([^ ,]+)\\ *,\\ *(audio|video)(\\ *,\\ *(?:missing|unlinked)?"
Using the regex
We’ll use
scan
to extract the matches (if any) against our regex on a string.
scan
returns a lazy list of all possible matches:
-- Simplified type signature for our purposes.
scan :: Regex -> String -> [(String, [String])]
Each match is a pair (String, [String])
, where the first component
is the whole string that matched, and the second is an ordered list of
parenthesized groupings in the regex. In our regex, we had three
parenthesized groupings, so a match could result in a three-element
grouping list:
*Main> scan mediaRegex "@Media:\tfoo, audio, unlinked"
[("@Media:\tfoo, audio, unlinked",["foo","audio",", unlinked"])]
Since we only want
the first match (if any), we can just compose it with
listToMaybe
from Data.Maybe
,
which has type
listToMaybe :: [a] -> Maybe a
so listToMaybe . scan mediaRegex
has type String -> Maybe (String, [String])
.
*Main> (listToMaybe . scan mediaRegex) "@Media:\tfoo, audio, unlinked"
Just ("@Media:\tfoo, audio, unlinked",["foo","audio",", unlinked"])
Extracting useful information
Finally, what we really wanted to do after matching is apply additional business logic and get stuff into a real type as soon as possible, rather than engage in “stringly-typed” programming and context-dependent list lengths.
Let’s say that for our task, we only care about matched lines that are not missing or unlinked, and skip those that are missing or unlinked. We define a data type and use pattern matching to get out of the untyped world into the typed world of our data model.
data Info =
Skip
| Audio FilePath
| Video FilePath
deriving (Eq, Show)
-- | Extract information about a media file if it is present.
extractIfPresent :: (String, [String]) -> Info
extractIfPresent (_, [name, "audio"]) = Audio name
extractIfPresent (_, [name, "video"]) = Video name
extractIfPresent (_, _) = Skip
Presentation as a report
Finally, now that we are done with the regex world, and have a data model, all that is left is a driver to complete an example command-line program.
We have all the information needed to print out a report for each line.
-- | Output a report.
reportOnInfo :: Maybe Info -> IO ()
reportOnInfo Nothing = putStrLn "no match"
reportOnInfo (Just Skip) = putStrLn "match, but missing or unlinked"
reportOnInfo (Just (Audio path)) = printf "audio at %s\n" path
reportOnInfo (Just (Video path)) = printf "video at %s\n" path
And the final driver, piping everything through from standard input:
main :: IO ()
main = do
s <- getContents
mapM_ (reportOnInfo
. fmap extractIfPresent
. listToMaybe
. scan mediaRegex
) (lines s)
Using Stack to ship standalone scripts
We can try our program from within the GHCi REPL by just typing main
or :main
at the REPL prompt and typing in lines of text. We can also
do stack build
to native-compile into a shippable binary.
But another option is to ship the source code as a standalone one-file script. This can be very convenient in some circumstances, when you can rely on the recipient simply installing Stack.
Here’s how we can turn our program into such a standalone script: just add the following two lines and make the file executable:
#!/usr/bin/env stack
-- stack --resolver lts-6.9 --install-ghc runghc --package pcre-heavy
Stack will read the embedded command in order to install GHC, if
needed, and first download and install the packages listed (here
pcre-heavy
), if needed. We have pinned down the exact version of
LTS in order to guarantee what versions of everything will be used by
Stack. (Note: in this case, because of FFI with a
C library, the recipient has to install PCRE first.)
So if you have short programs that don’t need to be organized into full-scale Cabal projects, you can treat Haskell as a “scripting language” with full access to the libraries of Hackage!
$ app/PCREHeavyExampleMain.hs < input.txt > output.txt
A warning
Although this Stack-as-Haskell-interpreter feature is kind of cool, I
prefer to write modular, separately testable libraries, while having
the main
driver of the Main
module of a program just use library
modules that do most of the real work. Furthermore, I prefer to build
and use native-compiled libraries and binaries because they’re just
much faster to start up and also run: runghc
is a Haskell
interpreter rather than a native optimizing compiler. But the beauty
of the GHC Haskell world is you can run in either mode, and flip from
one to the other seamlessly.
Here’s our complete example standalone program
#!/usr/bin/env stack
-- stack --resolver lts-6.9 --install-ghc runghc --package pcre-heavy
{-# LANGUAGE QuasiQuotes #-}
module Main where
import Text.Regex.PCRE.Heavy (Regex, re, scan)
import Data.Maybe (listToMaybe)
import Text.Printf (printf)
-- | Match a media name, audio/video, and optional missing/unlinked.
mediaRegex :: Regex
mediaRegex = [re|^@Media:\t([^ ,]+)\ *,\ *(audio|video)(\ *,\ *(?:missing|unlinked))?|]
data Info =
Skip
| Audio FilePath
| Video FilePath
deriving (Eq, Show)
-- | Extract information about a media file if it is present.
extractIfPresent :: (String, [String]) -> Info
extractIfPresent (_, [name, "audio"]) = Audio name
extractIfPresent (_, [name, "video"]) = Video name
extractIfPresent (_, _) = Skip
-- | Output a report.
reportOnInfo :: Maybe Info -> IO ()
reportOnInfo Nothing = putStrLn "no match"
reportOnInfo (Just Skip) = putStrLn "match, but missing or unlinked"
reportOnInfo (Just (Audio path)) = printf "audio at %s\n" path
reportOnInfo (Just (Video path)) = printf "video at %s\n" path
-- | Driver, in traditional right-to-left syntax.
main :: IO ()
main = do
s <- getContents
mapM_ (reportOnInfo
. fmap extractIfPresent
. listToMaybe
. scan mediaRegex
) (lines s)
Some additional notes
One limitation faced by a short expository article with example code is that we don’t like to waste space and attention, and therefore tend to present quick-and-dirty code, rather than production-level code (which is efficient, has sensible error recovery, well-commented). I’ve been thinking about the dilemma of how not to give the wrong impression and set a bad example by showing simplistic example code. There’s no easy answer, but I felt it might be useful to provide optional “advanced” notes sometimes, on how to write real code.
pcre-heavy
allows matching not only of String
, but also of
ByteString
and Text
types. In practice, for efficiency, we
want to use
bytestring
and
text
as much as possible,
rather than the inefficient String
type. (A 2012 day of hackage
article talks about text
.)
Since the underlying PCRE C library uses bytes, I generally hand
bytestrings to pcre-heavy
.
The sample driver code uses lazy I/O to get the lines from input. This
is superficially elegant and concise for pedagogical purposes, but in
real life is a source of resource leaks and other problems and even
causes people to think “Haskell is inefficient”. For real work, I like
to use pipes
, which was
covered in another
2012 day of Hackage
and also has an
extensive, beautiful tutorial
by its author, Gabriel Gonzalez, who also has a fantastic,
long-running, active blog
“Haskell for all” that every
Haskeller should follow.
Finally, was a regex the right choice here? It was simple enough for this problem, but you can see from the ad hoc pattern matching and hardcoded strings and fragile positional ordering and number of groups that things could get error-prone really quickly if the regex got any more complex or we wanted to do proper error handling in case of a failed match.
Conclusion
Regex support is not a strong point of the Haskell ecosystem, which is
geared to more structured parsing, but there are options if you really
want to use regexes, and I like the Perl-style pcre-light
family of
libraries that now includes pcre-heavy
.
Also, I showed how to add two lines to the top of a Haskell program to turn it into a Stack script.
(Update from day 9)
Day 9 covers more libraries based on Template Haskell.
All the code
All my code for my article series are at this GitHub repo.