missed opportunity for call-pattern specialisation

With strict ByteString an a simple tail recursive iteration pattern we get perfect code.

First the definitions:

data ByteString
   = BS  {-# UNPACK #-} !(ForeignPtr Word8)  -- payload
         {-# UNPACK #-} !Int                 -- offset
         {-# UNPACK #-} !Int                 -- length

null :: ByteString -> Bool
null (BS _ _ l) = l <= 0

tail :: ByteString -> ByteString
tail (BS p s l)
  | l <= 0     = error "ByteString.tail: empty ByteString"
  | otherwise  = BS p (s+1) (l-1)

Now a trivial iteration pattern:

length :: ByteString -> Int
length = go 0
  where
    go !n bs  | null bs    = n
              | otherwise  = go (n+1) (tail bs)

Perfect core code (edited for clarity):

go :: Int# -> Addr# -> ForeignPtrContents
   -> Int# -> Int# -> Int#
go = \n p fpc o l ->
  case l <=# 0 of
    False  -> go (n +# 1) p fpc (o +# 1) (l -# 1)
    True   -> n

length :: ByteString -> GHC.Base.Int
length = \bs ->
  case bs of
    BS p fpc 0 l ->
      case go 0 p fpc 0 l of
        n -> I# n

This worked because strict ByteString is a single constructor data type which allows it to be unpacked into separate parameters in the recursive call.

Now, lets try the same with lazy ByteStrings:

data ByteString
   =  Empty
   |  Chunk {-# UNPACK #-} !StrictBS.ByteString ByteString

This of course has two constructors. It's built for call-pattern specialisation.

Now the ops:

null :: ByteString -> Bool
null Empty = True
null _     = False

tail :: ByteString -> ByteString
tail Empty = error "empty tail"
tail (Chunk (S.BS fp s 1) cs) = cs
tail (Chunk (S.BS fp s l) cs) = Chunk (S.BS fp (s+1) (l-1)) cs

We can use the exact same code for tail, but now for the lazy ByteString type:

length :: ByteString -> Int
length = go 0
  where
    go !n bs  | null bs    = n
              | otherwise  = go (n+1) (tail bs)

But, oh noes!! The optimisation does not work:

go :: Int# -> ByteString -> Int#
go = \n bs ->
  case bs of
    Empty                 -> n
    Chunk p fpc o l bs'  ->
      go (n +# 1)
         (case l of
            1 -> bs'
            _ -> Chunk p fpc (o +# 1) (l -# 1) bs')

length :: ByteString -> Int
length = \bs ->
  case go 0 bs of
    n -> I# n

However this is not because call pattern specialisation didn't do what we wanted. We know this for several reasons. One, if we remove the case

tail (Chunk (S.BS fp s 1) cs) = cs

then call pattern specialisation works and the code ends up as a perfect loop. Of course we need that case for correctness.

Also, if we change the definition of lazy ByteString to be a single constructor and represent the end by an empty chunk then we still get essentially the same code.

Also, if we use uncons instead of null and tail then we effectively perform by hand the missing optimisation transformation and get perfect code (and call-pattern specialisation happens exactly as expected).

length :: ByteString -> Int
length = go 0
  where
    go !n bs = case uncons bs of
      Nothing        ->  n
      Just (_, bs')  ->  go (n+1) bs'

uncons :: ByteString -> Maybe (Word8, ByteString)
uncons Empty = Nothing
uncons (Chunk c cs)
  | StrictBS.length c == 1
  = Just (S.unsafeHead c, cs)

  | otherwise
  = Just (S.unsafeHead c, Chunk (S.unsafeTail c) cs)

This version with uncons gives us perfect code:

go_chunk  :: ByteString -> Int# -> Int#
          -> ForeignPtrContents -> Addr# -> Int# -> Int#
go_chunk = \bs' l o fpc p n ->
  case l of
    1 -> go (n +# 1) bs'
    _ -> go_chunk bs' (l -# 1) (o +# 1) fpc p (n +# 1)

go :: Int# -> ByteString -> Int#
go = \n bs ->
  case bs of
    Empty -> n
    Chunk p fpc o l bs' ->
      case l of
        1 -> go (n +# 1) bs'
        _ -> go_chunk bs' (l -# 1) (o +# 1) fpc p (n +# 1)

length :: ByteString -> Int
length = \bs ->
  case go 0 bs of
    n -> I# n

and we can see the specialisation rule that ghc invented for us:

forall bs l o fpc p n.
     go n (Chunk p fpc o l bs)
  =  go_chunk bs l o fpc p n

Aside: looks like there's an extra/missing reverse in there somewhere.

The problem with the head / tail version is that the following transformation is never performed and so the opportunity for call pattern specialisation (or even simple worker/wrapper unpacking) is never exposed:

    go (n+1)
       (case l of
          1  -> bs'
          _  -> Chunk p fpc (o+1) (l-1) bs')
=
    case l of
      1  -> go (n+1) bs'
      _  -> go (n+1) (Chunk p fpc (o+1) (l-1) bs')

which is the fragment inside the inlined definition of the go worker function:

go !n bs = case bs of
  Empty                -> n
  Chunk p fpc o l bs'  ->
    go (n+1)
       (case l of
          1  -> bs'
          _  -> Chunk p fpc (o+1) (l-1) bs')

So go is tail recursive and strict in both arguments. This situation will arise whenever we have something like

go (tail xs)

and tail involves a case analysis. Since go is just a function call there is no problem in duplicating it into the two branches of tail. The fact that it enables the call-pattern specialisation makes this a huge win.

This will happen with lazy ByteStrings or any other chunked representation using trees. We really need the per-chunk inner loop to be good with just a single test in the fast path to see if we're at the end of the chunk. When we get to the end of the chunk we can move into the slow path and do more cunning things with chunks and trees and lazyness. But the overall speed of these kinds of operations is determined by the quality of the inner loop. If that inner loop has to allocate a fresh constructor each time round then we lose.

In principle, if we can optimise perfectly then there is no reason for strict ByteString to have faster code than lazy ByteString, because their inner loops should look exactly the same.

Trac metadata

Trac field	Value
Version	6.10.1
Type	Bug
TypeOfFailure	OtherFailure
Priority	normal
Resolution	Unresolved
Component	Compiler
Test case
Differential revisions
BlockedBy
Related
Blocking
CC
Operating system
Architecture

Edited Mar 09, 2019 by duncan

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information