40% to 100% slowdown from -threaded
Unfortunately, I can't reproduce with GHC HEAD ghc-9.3.20220316 and head.hackage, because I'm getting a segfault (after applying a PR to sdl2-ttf that makes it compile). So the tests use GHC 9.2.2. To reproduce:
git clone git@github.com:LambdaHack/LambdaHack.git
git checkout v0.11.0.0
cabal build
make bench
then change LambdaHack.cabal by adding -threaded, as in
common exe-options
ghc-options: -rtsopts -threaded
and do again
cabal build
make bench
Depending on your version of C libsdl2 libraries this may or may not compile and/or run. Try master branch instead of v0.11.0.0 tag to overcome this.
My results without -threaded
~/r/LambdaHack$ make bench
$(cabal list-bin exe:LambdaHack) --dbgMsgSer --logPriority 4 --newGame 3 --noAnim --maxFps 100000 --frontendNull --benchmark --benchMessages --stopAfterFrames 1500 --automateAll --keepAutomated --gameMode battle --setDungeonRng "SMGen 127 123" --setMainRng "SMGen 127 125"
Session time: 0.927905003s; frames: 1500. Average clips per second: 6509.287028814522. Average FPS: 1616.5447919241362.
$(cabal list-bin exe:LambdaHack) --dbgMsgSer --logPriority 4 --newGame 3 --maxFps 100000 --frontendLazy --benchmark --benchMessages --stopAfterFrames 7000 --automateAll --keepAutomated --gameMode battle --setDungeonRng "SMGen 127 123" --setMainRng "SMGen 127 125"
Session time: 1.424040237s; frames: 7009. Average clips per second: 4766.719242638928. Average FPS: 4921.911486690667.
$(cabal list-bin exe:LambdaHack) --dbgMsgSer --logPriority 4 --newGame 3 --noAnim --maxFps 100000 --benchmark --benchMessages --stopAfterFrames 2000 --automateAll --keepAutomated --gameMode battle --setDungeonRng "SMGen 127 123" --setMainRng "SMGen 127 125"
Session time: 3.882431124s; frames: 2012. Average clips per second: 1706.147459784273. Average FPS: 518.2319880866481.
$(cabal list-bin exe:LambdaHack) --dbgMsgSer --logPriority 4 --newGame 1 --noAnim --maxFps 100000 --frontendNull --benchmark --benchMessages --stopAfterFrames 7000 --automateAll --keepAutomated --gameMode crawl --setDungeonRng "SMGen 123 123" --setMainRng "SMGen 123 125"
Session time: 3.159201467s; frames: 7010. Average clips per second: 7755.440815006434. Average FPS: 2218.915150940578.
$(cabal list-bin exe:LambdaHack) --dbgMsgSer --logPriority 4 --newGame 1 --noAnim --maxFps 100000 --benchmark --benchMessages --stopAfterFrames 7000 --automateAll --keepAutomated --gameMode crawl --setDungeonRng "SMGen 123 123" --setMainRng "SMGen 123 125"
Session time: 12.973855358s; frames: 7010. Average clips per second: 1888.4903002168958. Average FPS: 540.3174158001893.
and then with -threaded:
~/r/LambdaHack$ make bench
$(cabal list-bin exe:LambdaHack) --dbgMsgSer --logPriority 4 --newGame 3 --noAnim --maxFps 100000 --frontendNull --benchmark --benchMessages --stopAfterFrames 1500 --automateAll --keepAutomated --gameMode battle --setDungeonRng "SMGen 127 123" --setMainRng "SMGen 127 125"
Session time: 1.338236622s; frames: 1500. Average clips per second: 4513.402114921348. Average FPS: 1120.87800867252.
$(cabal list-bin exe:LambdaHack) --dbgMsgSer --logPriority 4 --newGame 3 --maxFps 100000 --frontendLazy --benchmark --benchMessages --stopAfterFrames 7000 --automateAll --keepAutomated --gameMode battle --setDungeonRng "SMGen 127 123" --setMainRng "SMGen 127 125"
Session time: 3.543814058s; frames: 7009. Average clips per second: 1915.4503845020865. Average FPS: 1977.8125729191404.
$(cabal list-bin exe:LambdaHack) --dbgMsgSer --logPriority 4 --newGame 3 --noAnim --maxFps 100000 --benchmark --benchMessages --stopAfterFrames 2000 --automateAll --keepAutomated --gameMode battle --setDungeonRng "SMGen 127 123" --setMainRng "SMGen 127 125"
Session time: 4.990851022s; frames: 2012. Average clips per second: 1327.2285569737448. Average FPS: 403.1376595155759.
$(cabal list-bin exe:LambdaHack) --dbgMsgSer --logPriority 4 --newGame 1 --noAnim --maxFps 100000 --frontendNull --benchmark --benchMessages --stopAfterFrames 7000 --automateAll --keepAutomated --gameMode crawl --setDungeonRng "SMGen 123 123" --setMainRng "SMGen 123 125"
Session time: 4.56074518s; frames: 7010. Average clips per second: 5372.1484172022965. Average FPS: 1537.0295255127585.
$(cabal list-bin exe:LambdaHack) --dbgMsgSer --logPriority 4 --newGame 1 --noAnim --maxFps 100000 --benchmark --benchMessages --stopAfterFrames 7000 --automateAll --keepAutomated --gameMode crawl --setDungeonRng "SMGen 123 123" --setMainRng "SMGen 123 125"
Session time: 17.029107743s; frames: 7010. Average clips per second: 1438.7718000123289. Average FPS: 411.6481089786713.
Here is some additional info from an investigation by @duog with a month or two older version of the codebase:
-
marking all imports in sdl2 and sdl2-ttf as
unsafe
does not improve the discrepancy -
results of perf stat on binaries WITH unsafe foreign calls in sdl2 and sdl2-ttf
with-threaded:
perf stat -dd make benchFrontendCrawl
$(cabal list-bin exe:LambdaHack) --dbgMsgSer --logPriority 4 --newGame 1 --noAnim --maxFps 100000 --benchmark --benchMessages --stopAfterFrames 7000 --automateAll --keepAutomated --gameMode crawl --frontendNull --setDungeonRng "SMGen 123 123" --setMainRng "SMGen 123 125"
Session time: 7.796642283s; frames: 7005. Average clips per second: 2809.28625489948. Average FPS: 898.463690616393.
Performance counter stats for 'make benchFrontendCrawl':
8,162.78 msec task-clock:u # 0.907 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
53,284 page-faults:u # 6.528 K/sec
23,013,213,417 cycles:u # 2.819 GHz (42.90%)
136,921,195 stalled-cycles-frontend:u # 0.59% frontend cycles idle (43.29%)
448,778,570 stalled-cycles-backend:u # 1.95% backend cycles idle (43.21%)
18,944,979,353 instructions:u # 0.82 insn per cycle
# 0.02 stalled cycles per insn (43.19%)
3,802,875,276 branches:u # 465.880 M/sec (43.12%)
202,717,840 branch-misses:u # 5.33% of all branches (43.10%)
7,928,213,661 L1-dcache-loads:u # 971.264 M/sec (42.98%)
285,102,363 L1-dcache-load-misses:u # 3.60% of all L1-dcache accesses (42.85%)
<not supported> LLC-loads:u
<not supported> LLC-load-misses:u
1,961,103,523 L1-icache-loads:u # 240.249 M/sec (43.03%)
19,840,051 L1-icache-load-misses:u # 1.01% of all L1-icache accesses (42.83%)
55,739,515 dTLB-loads:u # 6.828 M/sec (42.96%)
5,303,284 dTLB-load-misses:u # 9.51% of all dTLB cache accesses (42.99%)
46,210,060 iTLB-loads:u # 5.661 M/sec (42.98%)
4,717,836 iTLB-load-misses:u # 10.21% of all iTLB cache accesses (43.07%)
9.001519486 seconds time elapsed
7.657531000 seconds user
0.748577000 seconds sys
without -threaded
perf stat -dd make benchFrontendCrawl
$(cabal list-bin exe:LambdaHack) --dbgMsgSer --logPriority 4 --newGame 1 --noAnim --maxFps 100000 --benchmark --benchMessages --stopAfterFrames 7000 --automateAll --keepAutomated --gameMode crawl --frontendNull --setDungeonRng "SMGen 123 123" --setMainRng "SMGen 123 125"
Session time: 5.799896649s; frames: 7005. Average clips per second: 3776.4466033677422. Average FPS: 1207.7801422906011.
Performance counter stats for 'make benchFrontendCrawl':
6,236.18 msec task-clock:u # 0.896 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
53,332 page-faults:u # 8.552 K/sec
18,172,577,519 cycles:u # 2.914 GHz (42.88%)
173,871,640 stalled-cycles-frontend:u # 0.96% frontend cycles idle (42.89%)
384,575,183 stalled-cycles-backend:u # 2.12% backend cycles idle (42.90%)
18,797,086,192 instructions:u # 1.03 insn per cycle
# 0.02 stalled cycles per insn (42.83%)
3,777,918,464 branches:u # 605.807 M/sec (42.79%)
192,408,440 branch-misses:u # 5.09% of all branches (42.79%)
7,620,726,761 L1-dcache-loads:u # 1.222 G/sec (42.95%)
273,000,424 L1-dcache-load-misses:u # 3.58% of all L1-dcache accesses (43.04%)
<not supported> LLC-loads:u
<not supported> LLC-load-misses:u
1,880,798,031 L1-icache-loads:u # 301.595 M/sec (43.08%)
17,176,000 L1-icache-load-misses:u # 0.91% of all L1-icache accesses (43.09%)
54,063,881 dTLB-loads:u # 8.669 M/sec (43.15%)
5,052,959 dTLB-load-misses:u # 9.35% of all dTLB cache accesses (43.13%)
32,697,768 iTLB-loads:u # 5.243 M/sec (43.05%)
2,887,900 iTLB-load-misses:u # 8.83% of all iTLB cache accesses (42.95%)
- going vanilla to -threaded costs 20% instructions per cycle. I think that's quite bad. Unfortunately my AMD CPU doesn't support LLC-loads (i.e. level3 cache) counters.