Skip to content

Define reducing operations in Data.Foldable in terms of foldl' instead of foldMap or foldl1

Summary

This is a performance issue for the base library. The following operations in Data.Foldable take O(n) amount of space and non-optimal amount of time, which makes them of little use in practice. They can lead to performance issues that makes people waste a lot of time in debugging.

Type class operations:

  • minimum
  • maximum
  • sum
  • product

Non type class operations:

  • minimumBy
  • maximumBy

Also, not having an INLINABLE or INLINE on these and other operations in Data.Foldable makes global optimization involving these operations impossible and therefore requires re-definition to achieve that.

Steps to reproduce

Run the benchmarks at this commit (https://github.com/composewell/streamly/tree/b50e1a52850f8fdd1e70797815047cc7333f743d) in the streamly repository using the following command:

cabal run linear -- serially/foldable

If you increase the size of the stream, the amount of memory required would increase linearly, the following command demonstrates that:

cabal run linear -- --stream-size 100000000 serially/foldable

Expected behavior

It should run in constant memory and with much better timings for these operations. When these operations are implemented using foldl' we can see the difference in performance by running the above commands on this commit https://github.com/composewell/streamly/tree/007df998b99bcbea168763fccd79d49164ab04c9 .

The comparison of benchmarks can be seen in this pull request - https://github.com/composewell/streamly/pull/416 . In this PR we could fix the performance of those operations which are part of the Foldable type class, but we cannot fix minimumBy and maximumBy because they are not in the type class. However, they can also be fixed generically if we use foldl' instead of foldl1.

Our definitions of the type class operations can be found here: https://github.com/composewell/streamly/blob/04f84d34598b8d3f125b4919f51afb4b233f281e/src/Streamly/Internal/Data/Stream/Instances.hs#L138

Related Issues

Similar issues seem to have been discussed in ghc ticket #10830 (closed) specifically in this comment https://gitlab.haskell.org//ghc/ghc/issues/10830#note_129843.

Is there a good reason to use foldMap, foldl1 for these operations instead of using foldl'?

I understand that some of these operations can terminate early in some cases e.g. product can terminate if it becomes 0. However, those cases cannot be used as an excuse to penalize all other cases heavily. I think, in practice, it makes better sense to implement these using a strict left fold by default and if early termination cases are important to someone they can use custom definitions.

At the very least, the foldl1 in maximumBy and minimumBy can be replaced with foldl'.

Environment

  • GHC version used: 8.8.2

Optional:

  • Operating System: Mac OS
  • System Architecture: x86_64
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information