winio: Initial performance investigation

Now that all the correctness issues are taken care of in order to progress #20255 I am looking at the performance of WinIO.

To do so I'm using a rather pathological case. As my benchmark I'm using

import System.Environment

main = do
 args <- getArgs
 buf <- readFile $ head args
 print (show $ length buf)

which very inefficiently reads an entire file into memory and prints out the length of the file.

As my test file I use dd to create files of various sizes form 1mb to 1000mb:

dd if=/dev/zero of=test-big.img bs=1024 count=0 seek=1024000

Note that this benchmark is very synthetic, because the application only does I/O. In a real application you have other things to do which flattens the difference. i.e. GHCi for instance feels more responsive than before.

On the smaller files WinIO and MIO are pretty much on par, the I/O itself takes very little time and most time is spend in Haskell.

So for now the interesting sizes are the 300mb and up cases where the difference starts showing. Let's start with the 1GB file

Threaded RTS

Let's first start with the Threaded runtime as I think that's easier to explain.

Phyx@Rage /c/U/x/s/r/ghc (wip/T20955)> time ./testImg-thr.exe test-big.img +RTS --io-manager=posix
"1048576000"

________________________________________________________
Executed in    6.29 secs      fish           external
   usr time   15.00 millis    0.00 millis   15.00 millis
   sys time   16.00 millis   16.00 millis    0.00 millis

Phyx@Rage /c/U/x/s/r/ghc (wip/T20955)> time ./testImg-thr.exe test-big.img +RTS --io-manager=native
"1048576000"

________________________________________________________
Executed in    8.42 secs      fish           external
   usr time   15.00 millis    0.00 micros   15.00 millis
   sys time   46.00 millis    0.00 micros   46.00 millis

On a 1000mb image MIO is beating WinIO, the question is why..

Results show something interesting:

It looks like the majority of the time is spend on the Haskell RTS sleeping in a conditional variable and the actual I/O operations are pretty much at 100% efficiency. Drilling down a bit more it looks like this is coming from the scheduler waiting to yield:

My interpretation of this is that since not a lot of time is spent inside GetQueuedCompletionStatusEx that the remaining time is spend inside the processCompletion routine at which time the Haskell RTS is waiting for the I/O to be processed. In fact it looks like nearly 50% of the CPU time is spend sleeping in the scheduler.

My initial conclusion is that this can be fixed by finishing the implementation of the I/O threadpool manager. This would allow more worker threads to concurrently process the I/O requests so the Haskell side has more to do.

Secondly it needs to be checked whether processCompletion can't complete quicker. Right now there's a bunch of code that may not need to be executed on every completion. failing that, the core parts should maybe be in STG, but that's an extreme hammer.

Non-Threaded

By far the biggest difference is in the non-Threaded runtime. This is not unexpected, the previous I/O manager would complete the I/O entirely in C code whereas now being in Haskell we're subjected to GC and Scheduling. This is however a problem as non-threaded is still the default so most people will experience this mode.

To show the problem I have to drop down to using the smaller test file of 300mb:

Phyx@Rage /c/U/x/s/r/ghc (wip/T20955)> time ./testImg.exe test.img +RTS --io-manager=native
"309657600"

________________________________________________________
Executed in   64.70 secs      fish           external
   usr time    0.00 millis    0.00 micros    0.00 millis
   sys time   15.00 millis    0.00 micros   15.00 millis

Phyx@Rage /c/U/x/s/r/ghc (wip/T20955)> time ./testImg.exe test.img +RTS --io-manager=posix
"309657600"

________________________________________________________
Executed in    2.88 secs      fish           external
   usr time   46.00 millis    0.00 micros   46.00 millis
   sys time   15.00 millis    0.00 micros   15.00 millis

So where does this immense galaxy of difference come from?

Initial investigations show this is entirely due to being punished by the scheduler/GC:

Drilling down a bit deeper shows the following flame graph

What looks to be happening is that after every single I/O action we enter schedule which because we ran out of work again (since we didn't check with the I/O manager) triggers a GC. That means that after reading every 4k chunk of data we trigger a full GC/Evac which eats away at the runtime.

An example call stack sort of confirms this (as far as I can tell).

To fix this colossal difference in performance we need to prevent a GC after every I/O action. Looking at the RTS statistics confirms this

Phyx@Rage /c/U/x/s/r/ghc (wip/T20955) [1]> time ./testImg.exe test.img +RTS --io-manager=posix -T -s
"309657600"
  12,517,013,392 bytes allocated in the heap
          95,888 bytes copied during GC
          53,288 bytes maximum residency (2 sample(s))
          32,728 bytes maximum slop
               6 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      2995 colls,     0 par    0.297s   0.259s     0.0001s    0.0003s
  Gen  1         2 colls,     0 par    0.000s   0.001s     0.0004s    0.0005s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    1.609s  (  2.530s elapsed)
  GC      time    0.297s  (  0.260s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    1.906s  (  2.790s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    7,777,561,719 bytes per MUT second

  Productivity  84.4% of total user, 90.7% of total elapsed


________________________________________________________
Executed in    2.89 secs      fish           external
   usr time   15.00 millis    0.00 micros   15.00 millis
   sys time   46.00 millis    0.00 micros   46.00 millis

Phyx@Rage /c/U/x/s/r/ghc (wip/T20955)> time ./testImg.exe test.img +RTS --io-manager=native -T -s
"309657600"
  12,624,620,184 bytes allocated in the heap
       2,382,032 bytes copied during GC
       1,871,008 bytes maximum residency (37219 sample(s))
          62,304 bytes maximum slop
               9 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
  Gen  1     37219 colls,     0 par   61.484s  61.498s     0.0017s    0.0123s

  INIT    time    0.000s  (  0.001s elapsed)
  MUT     time    2.531s  (  2.682s elapsed)
  GC      time   61.484s  ( 61.498s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time   64.016s  ( 64.181s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    4,987,504,270 bytes per MUT second

  Productivity   4.0% of total user, 4.2% of total elapsed


________________________________________________________
Executed in   64.29 secs      fish           external
   usr time   15.00 millis    0.00 micros   15.00 millis
   sys time   46.00 millis    0.00 micros   46.00 millis

The entire time is spend in Idle GC.

Perhaps we can skip GC when the workqueue is empty but we have I/O outstanding?

GC is however not my area of expertise..

Edited Feb 20, 2022 by Tamar Christina

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information