winio: Initial performance investigation
Now that all the correctness issues are taken care of in order to progress #20255 I am looking at the performance of WinIO.
To do so I'm using a rather pathological case. As my benchmark I'm using
import System.Environment
main = do
args <- getArgs
buf <- readFile $ head args
print (show $ length buf)
which very inefficiently reads an entire file into memory and prints out the length of the file.
As my test file I use dd
to create files of various sizes form 1mb
to 1000mb
:
dd if=/dev/zero of=test-big.img bs=1024 count=0 seek=1024000
Note that this benchmark is very synthetic, because the application only does I/O. In a real application you have other things to do which flattens the difference. i.e. GHCi for instance feels more responsive than before.
On the smaller files WinIO and MIO are pretty much on par, the I/O itself takes very little time and most time is spend in Haskell.
So for now the interesting sizes are the 300mb
and up cases where the difference starts showing. Let's start with the 1GB file
Threaded RTS
Let's first start with the Threaded runtime as I think that's easier to explain.
Phyx@Rage /c/U/x/s/r/ghc (wip/T20955)> time ./testImg-thr.exe test-big.img +RTS --io-manager=posix
"1048576000"
________________________________________________________
Executed in 6.29 secs fish external
usr time 15.00 millis 0.00 millis 15.00 millis
sys time 16.00 millis 16.00 millis 0.00 millis
Phyx@Rage /c/U/x/s/r/ghc (wip/T20955)> time ./testImg-thr.exe test-big.img +RTS --io-manager=native
"1048576000"
________________________________________________________
Executed in 8.42 secs fish external
usr time 15.00 millis 0.00 micros 15.00 millis
sys time 46.00 millis 0.00 micros 46.00 millis
On a 1000mb image MIO is beating WinIO, the question is why..
Results show something interesting:
It looks like the majority of the time is spend on the Haskell RTS sleeping in a conditional variable and the actual I/O operations are pretty much at 100% efficiency. Drilling down a bit more it looks like this is coming from the scheduler waiting to yield:
My interpretation of this is that since not a lot of time is spent inside GetQueuedCompletionStatusEx
that the remaining time is spend inside the processCompletion
routine at which time the Haskell RTS is waiting for the I/O to be processed. In fact it looks like nearly 50% of the CPU time is spend sleeping in the scheduler.
My initial conclusion is that this can be fixed by finishing the implementation of the I/O threadpool manager. This would allow more worker threads to concurrently process the I/O requests so the Haskell side has more to do.
Secondly it needs to be checked whether processCompletion
can't complete quicker. Right now there's a bunch of code that may not need to be executed on every completion. failing that, the core parts should maybe be in STG, but that's an extreme hammer.
Non-Threaded
By far the biggest difference is in the non-Threaded runtime. This is not unexpected, the previous I/O manager would complete the I/O entirely in C code whereas now being in Haskell we're subjected to GC and Scheduling. This is however a problem as non-threaded
is still the default so most people will experience this mode.
To show the problem I have to drop down to using the smaller test file of 300mb
:
Phyx@Rage /c/U/x/s/r/ghc (wip/T20955)> time ./testImg.exe test.img +RTS --io-manager=native
"309657600"
________________________________________________________
Executed in 64.70 secs fish external
usr time 0.00 millis 0.00 micros 0.00 millis
sys time 15.00 millis 0.00 micros 15.00 millis
Phyx@Rage /c/U/x/s/r/ghc (wip/T20955)> time ./testImg.exe test.img +RTS --io-manager=posix
"309657600"
________________________________________________________
Executed in 2.88 secs fish external
usr time 46.00 millis 0.00 micros 46.00 millis
sys time 15.00 millis 0.00 micros 15.00 millis
So where does this immense galaxy of difference come from?
Initial investigations show this is entirely due to being punished by the scheduler/GC:
Drilling down a bit deeper shows the following flame graph
What looks to be happening is that after every single I/O action we enter schedule
which because we ran out of work again (since we didn't check with the I/O manager) triggers a GC. That means that after reading every 4k chunk of data we trigger a full GC/Evac which eats away at the runtime.
An example call stack sort of confirms this (as far as I can tell).
To fix this colossal difference in performance we need to prevent a GC after every I/O action. Looking at the RTS statistics confirms this
Phyx@Rage /c/U/x/s/r/ghc (wip/T20955) [1]> time ./testImg.exe test.img +RTS --io-manager=posix -T -s
"309657600"
12,517,013,392 bytes allocated in the heap
95,888 bytes copied during GC
53,288 bytes maximum residency (2 sample(s))
32,728 bytes maximum slop
6 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 2995 colls, 0 par 0.297s 0.259s 0.0001s 0.0003s
Gen 1 2 colls, 0 par 0.000s 0.001s 0.0004s 0.0005s
INIT time 0.000s ( 0.000s elapsed)
MUT time 1.609s ( 2.530s elapsed)
GC time 0.297s ( 0.260s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 1.906s ( 2.790s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 7,777,561,719 bytes per MUT second
Productivity 84.4% of total user, 90.7% of total elapsed
________________________________________________________
Executed in 2.89 secs fish external
usr time 15.00 millis 0.00 micros 15.00 millis
sys time 46.00 millis 0.00 micros 46.00 millis
Phyx@Rage /c/U/x/s/r/ghc (wip/T20955)> time ./testImg.exe test.img +RTS --io-manager=native -T -s
"309657600"
12,624,620,184 bytes allocated in the heap
2,382,032 bytes copied during GC
1,871,008 bytes maximum residency (37219 sample(s))
62,304 bytes maximum slop
9 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 0 colls, 0 par 0.000s 0.000s 0.0000s 0.0000s
Gen 1 37219 colls, 0 par 61.484s 61.498s 0.0017s 0.0123s
INIT time 0.000s ( 0.001s elapsed)
MUT time 2.531s ( 2.682s elapsed)
GC time 61.484s ( 61.498s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 64.016s ( 64.181s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 4,987,504,270 bytes per MUT second
Productivity 4.0% of total user, 4.2% of total elapsed
________________________________________________________
Executed in 64.29 secs fish external
usr time 15.00 millis 0.00 micros 15.00 millis
sys time 46.00 millis 0.00 micros 46.00 millis
The entire time is spend in Idle GC.
Perhaps we can skip GC when the workqueue is empty but we have I/O outstanding?
GC is however not my area of expertise..