24 days of Hackage, 2015: day 18: vector, vector-algorithms: unleash your inner C programmer!

Table of contents for the whole series

A table of contents is at the top of the article for day 1.

Day 18

(Reddit discussion)

Often in programming, the right data structure to use is a C-style array, whether treated as immutable or mutable.

Ideally, I didn’t need to write this article, but I’ve gotten the impression that many newcomers to Haskell don’t know about vector, the library for zero-indexed C-style arrays, and use lists where arrays are better suited to the problem. Arrays are efficient for indexing and are memory and cache efficient also, being allocated in a block of memory rather than spread out over pointer chasing.

The terminology regarding arrays in the Haskell ecosystem is confusing because Haskell in the 1990s originally came with a data structure called an Array, and there’s even a supporting array package, but in practice I never use it because it’s more generic and weird than the simple data structure later provided called “vectors” (for lack of a better name).

There are good tutorials on vector I recommend:

I thought it would be useful to give a fully worked out example of using both immutable and mutable vectors. In particular, we’ll use unboxed vectors that are specialized and therefore contain no indirection and are basically equivalent to C-style arrays. We’ll do some low-level C-style programming.

The problem to solve: finding the median of an array of 8-bit integers

Here is a problem found at Programming Praxis:

“The median of an array is the value in the middle if the array was sorted; if the array has an odd number of items n, the median is the (n+1)/2’th largest item in the array (which is also the (n+1)/2’th smallest item in the array), and if the array has an even number of items n, the median is the arithmetic average of the n/2’th smallest item in the array and the n/2’th largest item in the array. For instance, the median of the array [2,4,5,7,3,6,1] is 4 and the median of the array [5,2,1,6,3,4] is 3.5.

Your task is to write a program that takes an array of 8-bit integers (possibly but not necessarily in sort) and finds the median value in the array; you should find an algorithm that takes linear time and constant space.”

vector installation

Stackage LTS 3 is still on vector-0.10, but I wanted to use version 0.11 because the API changed slightly, so I added a bound as appropriate in the Cabal file.

(Update of 2016-01-06)

Stackage LTS 4 now uses this updated version of vector.

The importance of going unboxed

Because we are given that we should write code specific to an array of 8-bit integers, we will be using unboxed rather than boxed (generic) vectors: boxed vectors contain elements that are pointers to values, while unboxed vectors have elements are the actual values themselves. This is pretty important for efficiency if you know up front the type of your elements.

Some tests

Let’s write some tests first, assuming we will later write a module VectorExample.

Some imports first. We will be using the constant-time vector indexing operator !.

module VectorExampleSpec where

import VectorExample
       ( Value, MedianValue, averageValue
       , spaceInefficientMedian, constantSpaceMedian
       )

import Data.Vector.Unboxed ((!))
import qualified Data.Vector.Unboxed as V
import qualified Data.Vector.Algorithms.Radix as Radix

We will be using HSpec and QuickCheck, and writing our own random vector generators:

import Test.Hspec (Spec, hspec, describe, it, shouldBe)
import Test.Hspec.QuickCheck (prop)
import Test.QuickCheck (Arbitrary(arbitrary, shrink), choose, sized, shrinkList)

We will be implementing both a simple reference implementation and the desired clever one, so we will test both:

spec :: Spec
spec = do
  describe "compute median of vector of 8-bit unsigned integers" $ do
    medianSpec "inefficientSpaceMedian" inefficientSpaceMedian
    medianSpec "constantSpaceMedian" constantSpaceMedian

We test the examples given, as well as general properties. For sorting, we conveniently use the radix sort from the vector-algorithms package:

medianSpec :: String
            -> (V.Vector Value -> Maybe MedianValue)
            -> Spec
medianSpec description findMedian =
  describe description $ do
    describe "some examples" $ do
      it "handles odd number of elements" $ do
        findMedian (V.fromList [2, 4, 5, 7, 3, 6, 1]) `shouldBe` Just 4.0
      it "handles nonzero even number of elements" $ do
        findMedian (V.fromList [5, 2, 1, 6, 3, 4]) `shouldBe` Just 3.5
    describe "properties" $ do
      it "handles no elements" $ do
        findMedian V.empty `shouldBe` Nothing
      prop "handles one element" $ \v ->
        findMedian (V.singleton v) == Just (fromIntegral v)
      prop "handles odd number of elements" $
        \(VectorWithOdd values) ->
          let len = V.length values
              midIndex = pred (succ len `div` 2)
              sorted = V.modify Radix.sort values
          in findMedian values == Just (fromIntegral (sorted ! midIndex))
      prop "handles positive even number of elements" $
        \(VectorWithPositiveEven values) ->
          let len = V.length values
              midIndex = pred (succ len `div` 2)
              sorted = V.modify Radix.sort values
          in findMedian values ==
            Just (averageValue (fromIntegral (sorted ! midIndex))
                               (sorted ! succ midIndex))

The property tests specify what a median is supposed to be, for all possible cases.

A note on mutation

The radix sort operates on a mutable vector, but we are using an immutable vector, so we use the modify function that has a fancy type

modify :: Unbox a => (forall s. MVector s a -> ST s ()) -> Vector a -> Vector a

which takes a callback taking a mutable vector, and transforms an immutable vector into another immutable vector (possibly the same one “secretly” mutated underneath). It uses higher-rank types and the ST state transformer monad, but all you have to know is that Radix.sort has the compatible callback type and therefore can be plugged into modify:

sort :: forall e m v. (PrimMonad m, MVector v e, Radix e) => v (PrimState m) e -> m ()

Custom QuickCheck generators

Here are some quick and dirty generators of vectors of the kinds we want for our tests. If we were doing this for real, we’d probably want to write custom shrinkers rather than go through lists.

-- | Generate a vector with an odd number of elements.
newtype VectorWithOdd a =
  VectorWithOdd { getVectorWithOdd :: V.Vector a }
  deriving (Show)

instance (Arbitrary a, V.Unbox a) => Arbitrary (VectorWithOdd a) where
  arbitrary = sized $ \n -> do
    k <- choose (0, n `div` 2)
    VectorWithOdd <$> V.replicateM (2*k+1) arbitrary
  shrink = map (VectorWithOdd . V.fromList)
           . shrinkList shrink
           . V.toList
           . getVectorWithOdd

-- | Generate a vector with a nonzero even number of elements.
newtype VectorWithPositiveEven a =
  VectorWithPositiveEven { getVectorWithPositiveEven :: V.Vector a }
  deriving (Show)

instance (Arbitrary a, V.Unbox a) => Arbitrary (VectorWithPositiveEven a) where
  arbitrary = sized $ \n -> do
    k <- choose (1, 1 `max` (n `div` 2))
    VectorWithPositiveEven <$> V.replicateM (2*k) arbitrary
  shrink = map (VectorWithPositiveEven . V.fromList)
           . shrinkList shrink
           . V.toList
           . getVectorWithPositiveEven

An inefficient implementation that violates our space requirement

For reference (we could read it off from the QuickCheck tests almost):

{-# LANGUAGE BangPatterns #-}

module VectorExample where

import qualified Data.Word as Word
import qualified Data.Vector.Unboxed as V
import qualified Data.Vector.Unboxed.Mutable as M
import qualified Data.Vector.Algorithms.Radix as Radix

-- | Assume 8-bit unsigned integer.
type Value = Word.Word8

-- | What to return as median when there is one.
type MedianValue = Double

-- | This is an obvious implementation, basically the specification.
-- The use of radix sort makes this linear in time, but a copy of
-- the original array is required.
inefficientSpaceMedian :: V.Vector Value -> Maybe MedianValue
inefficientSpaceMedian values
  | V.null values = Nothing
  | odd len = Just (fromIntegral (atSorted midIndex))
  | otherwise = Just (averageValue (atSorted midIndex)
                                   (atSorted (succ midIndex)))
  where
    len = V.length values
    midIndex = pred (succ len `div` 2)

    -- Make a local mutable copy to sort.
    atSorted = V.unsafeIndex (V.modify Radix.sort values)

-- | Average of two values.
averageValue :: Value -> Value -> MedianValue
averageValue a b = (fromIntegral a + fromIntegral b) / 2

A note on unsafe operations

I promised we were going to do C-style programming. We could use bounds-checked indexing (which throws a runtime exception if accessing outside the bounds of a vector), but let’s have some “fun”, and use unsafeIndex. Note that this is truly unsafe: let’s get the element at index 100 of a 1-element Int vector:

Data.Vector.Unboxed> unsafeIndex (singleton (5 :: Int)) 100
4931475852

Why use unsafe operations? For illustration and also because it happens that we can manually prove (outside the type system) that we will never go out of bounds. In real life, I’d hesitate to use the unsafe operations unless maximum performance was critical though! But we can opt in the C world of potential disaster if we want to.

I do not want to promote using unsafe operations!! Globally search and replace all unsafeXyz functions in today’s sample code with xyz instead for peace of mind.

The clever solution

The clever solution allocates a table of every possible count (because we know our values are 8-bit unsigned integers) and fills it, then iterates through the table to find out where the median must lie, and when it hits the right place, returns the middle value (if an odd number of values) or the middle two values (if an even number of values).

-- | Number of occurrences of an 8-bit unsigned value.
-- We assume no overflow beyond 'Int' range.
type CountOfValue = Int

-- | Create a table of counts for each possible value, since we know
-- the number of values is small and finite.
constantSpaceMedian :: V.Vector Value -> Maybe MedianValue
constantSpaceMedian values
  | V.null values = Nothing
  | odd len = Just (findMid 0 (numToCount - countOf 0))
  | otherwise = Just (findMid2 0 (numToCount - countOf 0))
  where
    len = V.length values

    -- How many sorted elements to count off, to reach first median value.
    numToCount = succ len `div` 2

    -- Make efficient table of counts for each possible value.
    countOf = V.unsafeIndex (makeCountsMutably values)

    findMid !i !numRemaining
      | numRemaining <= 0 = fromIntegral i
      | otherwise = findMid (succ i) (numRemaining - countOf (succ i))

    findMid2 !i !numRemaining =
      case numRemaining `compare` 0 of
        LT -> fromIntegral i -- median is duplicated, don't need average
        GT -> findMid2 (succ i) (numRemaining - countOf (succ i))
        EQ -> midAverage (succ i) (countOf (succ i))
          where
            midAverage j 0 = midAverage (succ j) (countOf (succ j))
            midAverage j _ = averageValue (fromIntegral i) (fromIntegral j)

Note the use of the GHC extension BangPatterns, covered in a 2014 Day of GHC Extensions, to ensure strict evaluation in the tail-recursive loops.

Mutable versus immutable ways to create the counts table

What is makeCountsMutably?

Mutable

The version of the function that uses mutation does it using the create function:

-- | Use local mutation for efficiency in creating a table of counts,
-- looping through to update it, and freezing the result to return.
makeCountsMutably
  :: V.Vector Value        -- ^ values seen
  -> V.Vector CountOfValue -- ^ value => count
makeCountsMutably values = V.create $ do
  counts <- M.replicate numPossibleValues 0
  V.forM_ values $
     M.unsafeModify counts succ . fromIntegral
  return counts

-- | 256, in our case.
numPossibleValues :: Int
numPossibleValues = fromIntegral (maxBound :: Value) + 1

create has type

create :: Unbox a => (forall s. ST s (MVector s a)) -> Vector a

and cleverly “freezes”, in place, the mutable vector returned within ST monad context, into an immutable vector, avoiding any copying because the mutation happening inside the callback is never visible outside it.

Immutable

But wait, there’s also a way to create the table using only immmutable Vectors, no intermediate MVector needed!

-- | Make table of counts without using 'MVector'.
makeCountsPurely
  :: V.Vector Value
  -> V.Vector CountOfValue
makeCountsPurely =
  V.unsafeAccumulate (+) (V.replicate numPossibleValues 0)
  . V.map (\v -> (fromIntegral v, 1))

This is straightforward and elegant but seems to create an intermediate vector using map, which is not what we want to do given that the whole point is to be space efficient.

Fusion

But vector comes with sophisticated support for fusion.

A later Day of Hackage (it’s now here at day 19) will explain what is going on, but GHC actually rewrites the calls to unsafeAccumulate and map in order to avoid creating an intermediate vector!

Conclusion

Fixed-size arrays are a very important low-level building block for many computations. I use vector when doing (sequential) array operations. Hopefully this little example shows how to write efficient vector-based imperative code in Haskell while hiding as much of the state as possible, through selected type-oriented APIs that operate between the two worlds of mutable and immutable. It is even possible to write code relying on fusion that does not explicitly build a mutable vector.

(These are sequential arrays. For parallel arrays, see a 2013 Day of Hackage which covered repa, a library for “high performance, regular, multi-dimensional, shape polymorphic parallel arrays”.)

All the code

All my code for my article series are at this GitHub repo.

comments powered by Disqus