24 days of Hackage, 2015: day 19: ghc-core-html, list-fusion-probe; checking GHC's fusion rewrite rules for erasing intermediate data from existence

Table of contents for the whole series

A table of contents is at the top of the article for day 1.

Day 19

(Reddit discussion)

The single coolest feature of using Haskell, for me, has to be fusion. The GHC compiler performs this remarkable optimization that can erase entire intermediate data structures from existence.

What does that mean, and how can we know it happened? Today I’ll show how ghc-core-html and list-fusion-probe can sort of help in determining what the compiler actually did to your intermediate data structures.

I’ll give examples with lists as the intermediate data, but also mention vectors because yesterday, day 18, I briefly mentioned fusion in the context of vectors.

Imports for code

First, let’s get some imports out of the way before showing some HSpec and QuickCheck tests:

{-# LANGUAGE ScopedTypeVariables #-}

module ListFusionProbeSpec where

import Data.List.Fusion.Probe (fuseThis)

import Test.Hspec ( Spec, hspec, describe, it
                  , shouldBe, shouldSatisfy
                  , shouldThrow, errorCall
import Test.Hspec.QuickCheck (prop)
import Test.QuickCheck.Function (Fun(..), apply)
import Control.Exception (evaluate)
import Control.Arrow ((>>>))
import Data.Function ((&))

What is intermediate data in a pipeline?

When programming in a compositional way, often we create pipelines of data flow, in which data from one stage gets transformed into data for the next stage, and so on until the final output. The problem is that a naive implementation of a pipeline will result in construction of some data structure that exists only for the purpose of being consumed by the next stage.

Here’s an HSpec test illustrating a pipeline in which we name each intermediate list:

spec :: Spec
spec =
  describe "list-fusion-probe" $ do
    it "runs a chain of maps, filters" $
      let list1 = ["Hello", "my", "world!"]

          -- ["Hello", "world!"]
          list2 = filter ((> 2) . length) list1

          -- [5, 6]
          list3 = map length list2

          list4 = map (*3) list3
      in list4 `shouldBe` [15, 18]

In a typical implementation of a typical programming language, code looking like this will result in allocating and creating four lists (or arrays, or whatever collection type is idiomatic and desired), one after another, and traversing three lists using a filter and two maps. To do something cleverer and avoid creating intermediate data, one uses a special “stream” collection type, or more radically, “transducers”. These techniques enable collapsing the computation of the final data into a single traversal and no allocation of intermediate data.

The Haskell ecosystem contains many libraries for performing these kinds of optimizations, such as foldl, but they are outside the scope of this article. Instead, for simple examples such as the one above, GHC already automatically applies fusion, through special rewrite rules that are included in the standard libraries.

A note on pipeline syntax and composition

We can write the pipeline in many different ways. Here’s an OO-style way using & (reverse function application) and >>> (left-to-right composition):

    it "runs a chain of maps, filters (OO-style)" $
      let list4 = ["Hello", "my", "world!"] &
                     (filter (length >>> (> 2))
                      >>> map length
                      >>> map (*3)
      in list4 `shouldBe` [15, 18]

Here is the same thing but with the compositionality factored out. This is my preferred way of writing pipelines as first-class values:

    it "runs a chain of maps, filters, written compositionally with >>>" $
      let pipeline = filter (length >>> (> 2))
                     >>> map length
                     >>> map (*3)
      in pipeline ["Hello", "my", "world!"] `shouldBe` [15, 18]

Using traditional right-left composition .:

    it "runs a chain of maps, filters, written compositionally with ." $
      let pipeline = map (*3)
                     . map length
                     . filter ((> 2) . length)
      in pipeline ["Hello", "my", "world!"] `shouldBe` [15, 18]

What does rewriting mean?

The thing that rewriting does that the other approaches do not: the rewriting is a compiler preprocessing pass that rewrites your source code in order to optimize it. We’re not going into exactly how it works. There are rewrite rules that match source code constructs and replace them with semantically equivalent constructs, and basically fusion happens when during the course of repeatedly rewriting, certain constructs cancel themselves out, and “poof” goes your intermediate data.

If you look at the source code for map you will see scary-looking comments and compiler directives that look like

map :: (a -> b) -> [a] -> [b]
{-# NOINLINE [1] map #-}    -- We want the RULE to fire first.
                            -- It's recursive, so won't inline anyway,
                            -- but saying so is more explicit
map _ []     = []
map f (x:xs) = f x : map f xs

-- Note eta expanded
mapFB ::  (elt -> lst -> lst) -> (a -> elt) -> a -> lst -> lst
{-# INLINE [0] mapFB #-}
mapFB c f = \x ys -> c (f x) ys

-- The rules for map work like this.
-- Up to (but not including) phase 1, we use the "map" rule to
-- rewrite all saturated applications of map with its build/fold
-- form, hoping for fusion to happen.
-- In phase 1 and 0, we switch off that rule, inline build, and
-- switch on the "mapList" rule, which rewrites the foldr/mapFB
-- thing back into plain map.
-- It's important that these two rules aren't both active at once
-- (along with build's unfolding) else we'd get an infinite loop
-- in the rules.  Hence the activation control below.
-- The "mapFB" rule optimises compositions of map.
-- This same pattern is followed by many other functions:
-- e.g. append, filter, iterate, repeat, etc.

"map"       [~1] forall f xs.   map f xs                = build (\c n -> foldr (mapFB c f) n xs)
"mapList"   [1]  forall f.      foldr (mapFB (:) f) []  = map f
"mapFB"     forall c f g.       mapFB (mapFB c f) g     = mapFB c (f.g)

Some resources on fusion

Here are some articles illustrating more of the low-level workings of fusion:

But how do we know whether fusion worked?

The thing that has always bothered me is that there has not been an easy way to know whether fusion actually worked or tell the compiler, “I want the result of this expression fused, so report an error if you can’t fuse it for me because I care about maximum efficiency here”. For example, here is an open bug report about something that someone expected to be fused.

When writing high-abstraction code and expecting the compiler to do important optimizations for me, I consider it important to be notified if they are not happening as expected.

So it was interesting when I came across the list-fusion-probe library that does its own rewrite trickery to detect whether a list has fused or not. The way you use it with an expression of list type is you just wrap it with fuseThis, and the library will rewrite the call such that if fusion is happening, it just returns the list, but if not, it generates code that at runtime throws an exception. This is extremely primitive (I would never want to ship code that threw an exception “cannot fuse”), but is a useful proof of concept for testing.

So here are a few tests illustrating how insertion of fuseThis into pipelines verifies that fusion does happen (because there is no run time exception):

    it "fuses a chain of maps, filters" $
      let list1 = ["Hello", "my", "world!"]

          -- ["Hello", "world!"]
          list2 = fuseThis $ filter ((> 2) . length) list1

          -- [5, 6]
          list3 = fuseThis $ map length list2

          list4 = map (*3) list3
      in list4 `shouldBe` [15, 18]
    it "fuses a chain of maps, filters" $
      let list4 = ["Hello", "my", "world!"]
          pipeline = filter (length >>> (> 2)) >>> fuseThis
                     >>> map length >>> fuseThis
                     >>> map (*3)
      in pipeline list4 `shouldBe` [15, 18]
    it "Prelude foldl fuses" $
      let list = fuseThis [0..1001] :: [Int]
      in foldl (+) 0 list `shouldBe` 501501

We’ve verified here that list2 and list3 never exist as materialized data.

Note, for example, that the foldl example’s fusion amounts to the list [0..1001] never being created; instead, the generated machine code is basically a loop with index from 0 to 1001 adding to an accumulator.

Here’s a fusion failure:

    it "handwritten myFoldl fails to fuse" $
      let list = fuseThis [0..1001] :: [Int]
      in evaluate (myFoldl (+) 0 list) `shouldThrow`
           errorCall "fuseThis: List did not fuse"

where we provided myFoldl written in such a way that it does not match the fusion rewriting rules:

-- | This example taken straight from `list-fusion-probe` tests directory.
myFoldl :: (b -> a -> b) -> b -> [a] -> b
myFoldl f = go
  where go a [] = a
        go a (x:xs) = go (f a x) xs

Finally, for fun, let’s write a QuickCheck test that tests arbitrary predicates and transformation functions in a filter and map pipeline:

    prop "Prelude foldl fuses the result of a filter, map pipeline" $
      \(list :: [Int]) (predicate :: Fun Int Bool) (f :: Fun Int Int) ->
      let pipeline = fuseThis . filter (apply predicate)
                     . fuseThis . map (apply f)
         -- Just to force evaluation.
        foldl (+) 0 (pipeline list) `shouldSatisfy` (<= maxBound)

Why not report a fusion failure at compile time instead?

There’s a user interface issue when it comes to optimizations such as fusion. Ideally, I’d like to be able to declare desired fusion and have a failure be reported as a compile-time error.

However, the situation is tricky, because then the code can become brittle in the face of different optimization settings. For example, if you run the tests above within GHCi, you will find that GHCi never reports fusion failure. This is because it’s just not doing fusion at all: it’s ignoring rewrite rules, both in the standard library and in list-fusion-probe.

Maybe the solution is to have conditional directives. I do think that there has to be an option to treat a known performance bug as a compile-time error. It is not practical to live in fear and defensiveness when writing performance-oriented code and deciding, in the absence of clear compiler feedback, to simply manually write low-level ugly code because of lack of confidence that the elegant code will not perform properly.

What do you think? What kind of feedback would you like to get from compilation about optimizations you expect to be performed?

Checking generated code in yesterday’s vector example

Yesterday, I mentioned that fusion for vectors happens, but I wasn’t confident about how much actually happens. I found a useful tool ghc-core-html that outputs in somewhat prettified fashion the GHC Core code generated within GHC.

To use it, install it globally:

$ stack install ghc-core-html

I ran it on a sample modified source file to get HTML output:

$ stack exec ghc-core-html src/VectorFusionExample.hs > VectorFusionExample.html

The sample file:

-- | Extracted from VectorExample for minimal GHC Core output file.
module VectorFusionExample (makeCountsPurely) where

import qualified Data.Word as Word
import qualified Data.Vector.Unboxed as V

-- | Assume 8-bit unsigned integer.
type Value = Word.Word8

-- | Number of occurrences of an 8-bit unsigned value.
-- We assume no overflow beyond 'Int' range.
type CountOfValue = Int

-- | NOT the real thing!! With fake constants 123, 789
-- to help in reading the GHC Core.
makeCountsPurely :: V.Vector Value -> V.Vector CountOfValue
makeCountsPurely =
  V.unsafeAccumulate (+) (V.replicate numPossibleValues 123)
  . V.map (\v -> (fromIntegral v, 789))

-- | 256, in our case.
numPossibleValues :: Int
numPossibleValues = fromIntegral (maxBound :: Value) + 1

I’m not going to show the output, but just note that looking at it, I saw two calls to newByteArray#. Since one of the created arrays is the final array, but there is another allocation, we didn’t manage to get down to creating just a single array without allocating any other, unlike the implementation yesterday using an internal MVector.

This is good to know, but I shouldn’t have to generate low-level code and inspect it in order to obtain high-level information about what is fused where.

For more information

A tutorial on understanding the GHC compilation process is here. Here’s another.

A video by Simon Peyton Jones on Core.

Read anything and everything by Johan Tibell on Haskell performance.


Today I mentioned the importance of fusion and my frustration at not being able to get the kind of detailed information I would like about whether GHC performed fusion where I want it. However, list-fusion-probe helps with writing tests to detect list fusion failure, and ghc-core-html is a useful general-purpose tools for examining generated code for any purpose.

All the code

All my code for my article series are at this GitHub repo.

comments powered by Disqus