List `break` and `span` should provide stricter versions
GHC.List
currently provides strict versions of several standard functions (foldl
, etc), but this is not the case for break
and span
.
The provided implementation of break
is
break _ xs@[] = (xs, xs)
break p xs@(x:xs')
| p x = ([],xs)
| otherwise = let (ys,zs) = break p xs' in (x:ys,zs)
which could be made more strict on the let-bound tuple
break' _ xs@[] = (xs, xs)
break' p xs@(x:xs')
| p x = ([],xs)
| otherwise = let !(ys,zs) = break p xs' in (x:ys,zs)
and similarly for span
span' :: (a -> Bool) -> [a] -> ([a],[a])
span' _ xs@[] = (xs, xs)
span' p xs@(x:xs')
| p x = let !(ys,zs) = span p xs' in (x:ys,zs)
| otherwise = ([],xs)
STG
When we compare the STG from each version, the strict version seems to (as expected) have fewer allocations per level of recursion.
break'
:
Rec {
Main.$wbreak' [InlPrag=[2], Occ=LoopBreaker]
:: forall {t}. (t -> GHC.Types.Bool) -> [t] -> (# [t], [t] #)
[GblId[StrictWorker([~, !])],
Arity=2,
Str=<LC(S,L)><1L>,
Unf=OtherCon []] =
{} \r [ds_s4KF xs_s4KG]
case xs_s4KG<TagProper> of wild_s4KH [Occ=Once1] {
[] -> (#,#) [GHC.Types.[] GHC.Types.[]];
: x_s4KI xs1_s4KJ [Occ=Once1] ->
case ds_s4KF x_s4KI of {
GHC.Types.False ->
case
case xs1_s4KJ of xs1_t4MU {
__DEFAULT -> Main.$wbreak' ds_s4KF xs1_t4MU;
}
of
{
(#,#) ww_s4KM [Occ=Once1] ww1_s4KN [Occ=Once1] ->
let {
sat_s4KO [Occ=Once1] :: [t_s4F6]
[LclId] =
:! [x_s4KI ww_s4KM];
} in (#,#) [sat_s4KO ww1_s4KN];
};
GHC.Types.True -> (#,#) [GHC.Types.[] wild_s4KH];
};
};
end Rec }
break
:
Rec {
GHC.List.$wbreak [InlPrag=[2], Occ=LoopBreaker]
:: forall {a}. (a -> GHC.Types.Bool) -> [a] -> (# [a], [a] #)
[GblId[StrictWorker([~, !])],
Arity=2,
Str=<LC(S,L)><1L>,
Unf=OtherCon []] =
{} \r [ds_s35A xs_s35B]
case xs_s35B<TagProper> of wild_s35C [Occ=Once1] {
[] -> (#,#) [GHC.Types.[] GHC.Types.[]];
: x_s35D xs'_s35E [Occ=Once1] ->
case ds_s35A x_s35D of {
GHC.Types.False ->
let {
ds1_s35G [Dmd=LP(ML,ML)] :: ([a_s2Jt], [a_s2Jt])
[LclId] =
{ds_s35A, xs'_s35E} \u []
case
case xs'_s35E of xs'_t3in [Occ=Once1] {
__DEFAULT -> GHC.List.$wbreak ds_s35A xs'_t3in;
}
of
{
(#,#) ww_s35I [Occ=Once1] ww1_s35J [Occ=Once1] ->
(,) [ww_s35I ww1_s35J];
}; } in
let {
sat_s35S [Occ=Once1] :: [a_s2Jt]
[LclId] =
{ds1_s35G} \u []
case ds1_s35G of {
(,) _ [Occ=Dead] zs_s35R [Occ=Once1] -> zs_s35R;
}; } in
let {
sat_s35N [Occ=Once1] :: [a_s2Jt]
[LclId] =
{ds1_s35G} \u []
case ds1_s35G of {
(,) ys_s35L [Occ=Once1] _ [Occ=Dead] -> ys_s35L;
}; } in
let {
sat_s35O [Occ=Once1] :: [a_s2Jt]
[LclId] =
:! [x_s35D sat_s35N];
} in (#,#) [sat_s35O sat_s35S];
GHC.Types.True -> (#,#) [GHC.Types.[] wild_s35C];
};
};
end Rec }
Microbenchmarks
For simple predicates, in this case just comparing the value of Int
list elements, microbenchmarks are noticeably in favour of the stricter version of break. Since Criterion will fully evaluate the output lists, this does assume that all elements are consumed. Because of this, I don't expect it to be a universal optimisation, but I do think it would be selectively useful.
My microbenchmarks test breaking at the start, middle, and end of a 100,000 element list of Int
s, and all 3 cases appear in favour of the stricter version.
Start:
benchmarking 100000/zero/break
time 349.1 μs (346.1 μs .. 352.6 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 349.1 μs (347.0 μs .. 351.1 μs)
std dev 7.036 μs (5.702 μs .. 8.583 μs)
variance introduced by outliers: 12% (moderately inflated)
benchmarking 100000/zero/break'
time 302.2 μs (298.7 μs .. 305.3 μs)
0.999 R² (0.998 R² .. 1.000 R²)
mean 298.3 μs (296.1 μs .. 301.0 μs)
std dev 7.848 μs (6.737 μs .. 9.845 μs)
variance introduced by outliers: 20% (moderately inflated)
Middle:
benchmarking 100000/half/break
time 9.153 ms (8.632 ms .. 9.503 ms)
0.988 R² (0.964 R² .. 0.999 R²)
mean 9.331 ms (9.082 ms .. 9.453 ms)
std dev 466.1 μs (235.1 μs .. 750.4 μs)
variance introduced by outliers: 24% (moderately inflated)
benchmarking 100000/half/break'
time 4.496 ms (4.457 ms .. 4.532 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 4.469 ms (4.447 ms .. 4.491 ms)
std dev 69.81 μs (53.07 μs .. 93.55 μs)
End:
benchmarking 100000/full/break
time 19.32 ms (18.49 ms .. 20.32 ms)
0.989 R² (0.979 R² .. 0.996 R²)
mean 16.99 ms (15.56 ms .. 18.08 ms)
std dev 2.941 ms (1.791 ms .. 3.642 ms)
variance introduced by outliers: 73% (severely inflated)
benchmarking 100000/full/break'
time 9.800 ms (9.611 ms .. 10.04 ms)
0.995 R² (0.992 R² .. 0.998 R²)
mean 8.634 ms (8.159 ms .. 8.947 ms)
std dev 1.069 ms (772.0 μs .. 1.440 ms)
variance introduced by outliers: 67% (severely inflated)
Building Cabal
By building the Cabal library (https://gitlab.haskell.org/ghc/ghc/-/wikis/debugging/cabal-test), I don't find any evidence that break'
should be the default in GHC. My tests compared universally replacing the implementation of break with the stricter version.
Lazy:
18,821,715,760 bytes allocated in the heap
1,595,627,384 bytes copied during GC
126,869,048 bytes maximum residency (11 sample(s))
5,161,416 bytes maximum slop
336 MiB total memory in use (0 MiB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 534 colls, 0 par 4.418s 4.421s 0.0083s 0.0736s
Gen 1 11 colls, 0 par 1.609s 1.610s 0.1463s 0.3249s
TASKS: 5 (1 bound, 4 peak workers (4 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.001s ( 0.001s elapsed)
MUT time 27.753s ( 32.008s elapsed)
GC time 6.027s ( 6.031s elapsed)
EXIT time 0.001s ( 0.001s elapsed)
Total time 33.783s ( 38.040s elapsed)
Alloc rate 678,181,401 bytes per MUT second
Productivity 82.2% of total user, 84.1% of total elapsed
Stricter:
18,844,871,488 bytes allocated in the heap
1,476,461,184 bytes copied during GC
134,575,864 bytes maximum residency (11 sample(s))
5,196,688 bytes maximum slop
366 MiB total memory in use (0 MiB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 452 colls, 0 par 4.010s 4.013s 0.0089s 0.0764s
Gen 1 11 colls, 0 par 1.691s 1.692s 0.1538s 0.3884s
TASKS: 5 (1 bound, 4 peak workers (4 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.001s ( 0.001s elapsed)
MUT time 27.654s ( 32.686s elapsed)
GC time 5.701s ( 5.704s elapsed)
EXIT time 0.001s ( 0.001s elapsed)
Total time 33.358s ( 38.392s elapsed)
Alloc rate 681,448,553 bytes per MUT second
Productivity 82.9% of total user, 85.1% of total elapsed
This test also shows a possible memory usage disadvantage on real-world cases.
Conclusion
Providing stricter versions of break
and span
is consistent with the rest of the GHC.List
module. The generated STG and microbenchmarks are in favour of the stricter version of break
, but universally replacing the implementation doesn't have a measurable effect on time performance in a real world test, with a possible negative effect on memory performance in the same real world test. So, these versions should be added for possible usage in future, especially if profiling finds any pathological uses of the lazy version in GHC.