Poor performance for "sequential" forkIOs

The ghcide developers have noticed that overall performance is quite poor with the default Shake scheduler (see #18221, https://github.com/ndmitchell/shake/pull/751, https://github.com/digital-asset/ghcide/issues/503). My understanding of the default scheduler is needing a target results in a new thread being forkIO'd and then the caller blocking on an MVar (or some other sort of blocking synchronization primitive).

Intriguingly, they have noticed that rather using forkOn instead of forkIO improves the situation markedly. @wz1000 provided two eventlogs of similar workloads under the forkIO and forkOn schemes. Comparing the two, the problem becomes pretty clear:

`ghcide` with `forkOn`

Here we see a single hover event when running with the forkOn scheme, where all threads are forked on capability 0:

On the whole it's clear that there is little parallelism available here; we only ever have one core working at any given moment.

`ghcide` with `forkIO`

Here we see a single hover event with the much less efficient forkIO scheme:

Zooming in further reveals what is happening:

Here the blue lines here denote thread creations; the red lines denote thread migrations. It seems that every time a Shake action needs a rule, we kick off a new Haskell thread, push it to another capability's run queue, and (nearly) immediately park the capability we are running on, resulting a "stair-stepping" effect in the eventlog. Between each pair of "stairs" in this eventlog there is an idle period of anywhere from 10 to 500 microseconds. In text, the events above look like this:

...
41577543192: cap 0: running thread 15902
41577622290: cap 0: creating thread 15903
41577631265: cap 0: stopping thread 15902 (thread finished)
41577632790: cap 0: running thread 15903
41577637059: cap 0: stopping thread 15903 (thread yielding)
41577637660: cap 0: running thread 15903
41577671853: cap 0: creating thread 15904
41577674842: cap 0: stopping thread 15903 (thread finished)
41577675399: cap 0: running thread 15904
41577681143: cap 0: stopping thread 15904 (thread yielding)
41577681726: cap 0: running thread 15904
41577710211: cap 0: creating thread 15905
41577713041: cap 0: stopping thread 15904 (thread yielding)
41577714797: cap 0: migrating thread 15905 to cap 1
41577730170: cap 0: running thread 15904
41577731107: cap 0: stopping thread 15904 (thread finished)
41577788360: cap 1: running thread 15905
...

To service this one hover request there are likely over 400 such context switches. Assuming an average of 100us per context switch, this would result in nearly 40ms of lost time. Terrible.

Edited May 23, 2020 by Ben Gamari

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Poor performance for "sequential" forkIOs

ghcide with forkOn

ghcide with forkIO

`ghcide` with `forkOn`

`ghcide` with `forkIO`