List `break` and `span` should provide stricter versions

GHC.List currently provides strict versions of several standard functions (foldl, etc), but this is not the case for break and span.

The provided implementation of break is

break _ xs@[]           =  (xs, xs)
break p xs@(x:xs')
           | p x        =  ([],xs)
           | otherwise  =  let (ys,zs) = break p xs' in (x:ys,zs)

which could be made more strict on the let-bound tuple

break' _ xs@[]           =  (xs, xs)
break' p xs@(x:xs')
            | p x        =  ([],xs)
            | otherwise  =  let !(ys,zs) = break p xs' in (x:ys,zs)

and similarly for span

span'                    :: (a -> Bool) -> [a] -> ([a],[a])
span' _ xs@[]            =  (xs, xs)
span' p xs@(x:xs')
          | p x          =  let !(ys,zs) = span p xs' in (x:ys,zs)
          | otherwise    =  ([],xs)

STG

When we compare the STG from each version, the strict version seems to (as expected) have fewer allocations per level of recursion.

break':

Rec {
Main.$wbreak' [InlPrag=[2], Occ=LoopBreaker]
  :: forall {t}. (t -> GHC.Types.Bool) -> [t] -> (# [t], [t] #)
[GblId[StrictWorker([~, !])],
 Arity=2,
 Str=<LC(S,L)><1L>,
 Unf=OtherCon []] =
    {} \r [ds_s4KF xs_s4KG]
        case xs_s4KG<TagProper> of wild_s4KH [Occ=Once1] {
          [] -> (#,#) [GHC.Types.[] GHC.Types.[]];
          : x_s4KI xs1_s4KJ [Occ=Once1] ->
              case ds_s4KF x_s4KI of {
                GHC.Types.False ->
                    case
                        case xs1_s4KJ of xs1_t4MU {
                        __DEFAULT -> Main.$wbreak' ds_s4KF xs1_t4MU;
                        }
                    of
                    {
                    (#,#) ww_s4KM [Occ=Once1] ww1_s4KN [Occ=Once1] ->
                    let {
                      sat_s4KO [Occ=Once1] :: [t_s4F6]
                      [LclId] =
                          :! [x_s4KI ww_s4KM];
                    } in  (#,#) [sat_s4KO ww1_s4KN];
                    };
                GHC.Types.True -> (#,#) [GHC.Types.[] wild_s4KH];
              };
        };
end Rec }

break:

Rec {
GHC.List.$wbreak [InlPrag=[2], Occ=LoopBreaker]
  :: forall {a}. (a -> GHC.Types.Bool) -> [a] -> (# [a], [a] #)
[GblId[StrictWorker([~, !])],
 Arity=2,
 Str=<LC(S,L)><1L>,
 Unf=OtherCon []] =
    {} \r [ds_s35A xs_s35B]
        case xs_s35B<TagProper> of wild_s35C [Occ=Once1] {
          [] -> (#,#) [GHC.Types.[] GHC.Types.[]];
          : x_s35D xs'_s35E [Occ=Once1] ->
              case ds_s35A x_s35D of {
                GHC.Types.False ->
                    let {
                      ds1_s35G [Dmd=LP(ML,ML)] :: ([a_s2Jt], [a_s2Jt])
                      [LclId] =
                          {ds_s35A, xs'_s35E} \u []
                              case
                                  case xs'_s35E of xs'_t3in [Occ=Once1] {
                                  __DEFAULT -> GHC.List.$wbreak ds_s35A xs'_t3in;
                                  }
                              of
                              {
                              (#,#) ww_s35I [Occ=Once1] ww1_s35J [Occ=Once1] ->
                              (,) [ww_s35I ww1_s35J];
                              }; } in
                    let {
                      sat_s35S [Occ=Once1] :: [a_s2Jt]
                      [LclId] =
                          {ds1_s35G} \u []
                              case ds1_s35G of {
                              (,) _ [Occ=Dead] zs_s35R [Occ=Once1] -> zs_s35R;
                              }; } in
                    let {
                      sat_s35N [Occ=Once1] :: [a_s2Jt]
                      [LclId] =
                          {ds1_s35G} \u []
                              case ds1_s35G of {
                              (,) ys_s35L [Occ=Once1] _ [Occ=Dead] -> ys_s35L;
                              }; } in
                    let {
                      sat_s35O [Occ=Once1] :: [a_s2Jt]
                      [LclId] =
                          :! [x_s35D sat_s35N];
                    } in  (#,#) [sat_s35O sat_s35S];
                GHC.Types.True -> (#,#) [GHC.Types.[] wild_s35C];
              };
        };
end Rec }

Microbenchmarks

For simple predicates, in this case just comparing the value of Int list elements, microbenchmarks are noticeably in favour of the stricter version of break. Since Criterion will fully evaluate the output lists, this does assume that all elements are consumed. Because of this, I don't expect it to be a universal optimisation, but I do think it would be selectively useful.

My microbenchmarks test breaking at the start, middle, and end of a 100,000 element list of Ints, and all 3 cases appear in favour of the stricter version. Start:

benchmarking 100000/zero/break
time                 349.1 μs   (346.1 μs .. 352.6 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 349.1 μs   (347.0 μs .. 351.1 μs)
std dev              7.036 μs   (5.702 μs .. 8.583 μs)
variance introduced by outliers: 12% (moderately inflated)

benchmarking 100000/zero/break'
time                 302.2 μs   (298.7 μs .. 305.3 μs)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 298.3 μs   (296.1 μs .. 301.0 μs)
std dev              7.848 μs   (6.737 μs .. 9.845 μs)
variance introduced by outliers: 20% (moderately inflated)

Middle:

benchmarking 100000/half/break
time                 9.153 ms   (8.632 ms .. 9.503 ms)
                     0.988 R²   (0.964 R² .. 0.999 R²)
mean                 9.331 ms   (9.082 ms .. 9.453 ms)
std dev              466.1 μs   (235.1 μs .. 750.4 μs)
variance introduced by outliers: 24% (moderately inflated)

benchmarking 100000/half/break'
time                 4.496 ms   (4.457 ms .. 4.532 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 4.469 ms   (4.447 ms .. 4.491 ms)
std dev              69.81 μs   (53.07 μs .. 93.55 μs)

End:

benchmarking 100000/full/break
time                 19.32 ms   (18.49 ms .. 20.32 ms)
                     0.989 R²   (0.979 R² .. 0.996 R²)
mean                 16.99 ms   (15.56 ms .. 18.08 ms)
std dev              2.941 ms   (1.791 ms .. 3.642 ms)
variance introduced by outliers: 73% (severely inflated)

benchmarking 100000/full/break'
time                 9.800 ms   (9.611 ms .. 10.04 ms)
                     0.995 R²   (0.992 R² .. 0.998 R²)
mean                 8.634 ms   (8.159 ms .. 8.947 ms)
std dev              1.069 ms   (772.0 μs .. 1.440 ms)
variance introduced by outliers: 67% (severely inflated)

Building Cabal

By building the Cabal library (https://gitlab.haskell.org/ghc/ghc/-/wikis/debugging/cabal-test), I don't find any evidence that break' should be the default in GHC. My tests compared universally replacing the implementation of break with the stricter version.

Lazy:

  18,821,715,760 bytes allocated in the heap
   1,595,627,384 bytes copied during GC
     126,869,048 bytes maximum residency (11 sample(s))
       5,161,416 bytes maximum slop
             336 MiB total memory in use (0 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       534 colls,     0 par    4.418s   4.421s     0.0083s    0.0736s
  Gen  1        11 colls,     0 par    1.609s   1.610s     0.1463s    0.3249s

  TASKS: 5 (1 bound, 4 peak workers (4 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time   27.753s  ( 32.008s elapsed)
  GC      time    6.027s  (  6.031s elapsed)
  EXIT    time    0.001s  (  0.001s elapsed)
  Total   time   33.783s  ( 38.040s elapsed)

  Alloc rate    678,181,401 bytes per MUT second

  Productivity  82.2% of total user, 84.1% of total elapsed

Stricter:

  18,844,871,488 bytes allocated in the heap
   1,476,461,184 bytes copied during GC
     134,575,864 bytes maximum residency (11 sample(s))
       5,196,688 bytes maximum slop
             366 MiB total memory in use (0 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       452 colls,     0 par    4.010s   4.013s     0.0089s    0.0764s
  Gen  1        11 colls,     0 par    1.691s   1.692s     0.1538s    0.3884s

  TASKS: 5 (1 bound, 4 peak workers (4 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time   27.654s  ( 32.686s elapsed)
  GC      time    5.701s  (  5.704s elapsed)
  EXIT    time    0.001s  (  0.001s elapsed)
  Total   time   33.358s  ( 38.392s elapsed)

  Alloc rate    681,448,553 bytes per MUT second

  Productivity  82.9% of total user, 85.1% of total elapsed

This test also shows a possible memory usage disadvantage on real-world cases.

Conclusion

Providing stricter versions of break and span is consistent with the rest of the GHC.List module. The generated STG and microbenchmarks are in favour of the stricter version of break, but universally replacing the implementation doesn't have a measurable effect on time performance in a real world test, with a possible negative effect on memory performance in the same real world test. So, these versions should be added for possible usage in future, especially if profiling finds any pathological uses of the lazy version in GHC.

Edited Feb 13, 2023 by Josh Meredith

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information