- 03 Apr, 2009 2 commits
-
-
Simon Marlow authored
Benchmarks show that using TLS instead of stealing a register is better by a few percent on x86, due to the lack of registers. This only affects -threaded; without -threaded we're (now) using static storage for the GC data.
-
Simon Marlow authored
-
- 13 Mar, 2009 1 commit
-
-
Simon Marlow authored
New flag: "+RTS -qb" disables load-balancing in the parallel GC (though this is subject to change, I think we will probably want to do something more automatic before releasing this). To get the "PARGC3" configuration described in the "Runtime support for Multicore Haskell" paper, use "+RTS -qg0 -qb -RTS". The main advantage of this is that it allows us to easily disable load-balancing altogether, which turns out to be important in parallel programs. Maintaining locality is sometimes more important that spreading the work out in parallel GC. There is a side benefit in that the parallel GC should have improved locality even when load-balancing, because each processor prefers to take work from its own queue before stealing from others.
-
- 12 Jan, 2009 1 commit
-
-
Simon Marlow authored
This turns out to be quite vital for parallel programs: - The way we discover which threads to traverse is by finding dirty threads via the remembered sets (aka mutable lists). - A dirty thread will be on the remembered set of the capability that was running it, and we really want to traverse that thread's stack using the GC thread for the capability, because it is in that CPU's cache. If we get this wrong, we get penalised badly by the memory system. Previously we had per-capability mutable lists but they were aggregated before GC and traversed by just one of the GC threads. This resulted in very poor performance particularly for parallel programs with deep stacks. Now we keep per-capability remembered sets throughout GC, which also removes a lock (recordMutableGen_sync).
-
- 05 Jan, 2009 1 commit
-
-
Ben.Lippmeier@anu.edu.au authored
This makes the build work again.
-
- 21 Nov, 2008 1 commit
-
-
Simon Marlow authored
Previously, the GC had its own pool of threads to use as workers when doing parallel GC. There was a "leader", which was the mutator thread that initiated the GC, and the other threads were taken from the pool. This was simple and worked fine for sequential programs, where we did most of the benchmarking for the parallel GC, but falls down for parallel programs. When we have N mutator threads and N cores, at GC time we would have to stop N-1 mutator threads and start up N-1 GC threads, and hope that the OS schedules them all onto separate cores. It practice it doesn't, as you might expect. Now we use the mutator threads to do GC. This works quite nicely, particularly for parallel programs, where each mutator thread scans its own spark pool, which is probably in its cache anyway. There are some flag changes: -g<n> is removed (-g1 is still accepted for backwards compat). There's no way to have a different number of GC threads than mutator threads now. -q1 Use one OS thread for GC (turns off parallel GC) -qg<n> Use parallel GC for generations >= <n> (default: 1) Using parallel GC only for generations >=1 works well for sequential programs. Compiling an ordinary sequential program with -threaded and running it with -N2 or more should help if you do a lot of GC. I've found that adding -qg0 (do parallel GC for generation 0 too) speeds up some parallel programs, but slows down some sequential programs. Being conservative, I left the threshold at 1. ToDo: document the new options.
-
- 25 Jul, 2008 1 commit
-
-
Simon Marlow authored
-
- 03 Jun, 2008 1 commit
-
-
Simon Marlow authored
-
- 17 Apr, 2008 1 commit
-
-
simonmarhaskell@gmail.com authored
-
- 16 Apr, 2008 13 commits
-
-
simonmarhaskell@gmail.com authored
-
simonmarhaskell@gmail.com authored
-
simonmarhaskell@gmail.com authored
- GCAux.c contains code not compiled with the gct register enabled, it is callable from outside the GC - marking functions are moved to their relevant subsystems, outside the GC - mark_root needs to save the gct register, as it is called from outside the GC
-
simonmarhaskell@gmail.com authored
-
simonmarhaskell@gmail.com authored
- count and report number of parallel collections - calculate bytes scanned in addition to bytes copied per thread - calculate "work balance factor" - tidy up the formatting a bit
-
simonmarhaskell@gmail.com authored
This means we can calculate slop easily, and also improve predictability of GC.
-
simonmarhaskell@gmail.com authored
-
simonmarhaskell@gmail.com authored
-
simonmarhaskell@gmail.com authored
So we can parallelise minor collections too. Sometimes it's worth it.
-
simonmarhaskell@gmail.com authored
-
simonmarhaskell@gmail.com authored
-
simonmarhaskell@gmail.com authored
DEBUG imposes a significant performance hit in the GC, yet we often want some of the debugging output, so -vg gives us the cheap trace messages without the sanity checking of DEBUG, just like -vs for the scheduler.
-
simonmarhaskell@gmail.com authored
-
- 11 Jan, 2008 1 commit
-
-
simonmar@microsoft.com authored
-
- 21 Nov, 2007 1 commit
-
-
Simon Marlow authored
avoids cache contention: bd->todo_bd->free may clash with any cache line, so we localise it.
-
- 31 Oct, 2007 5 commits
-
-
Simon Marlow authored
-
Simon Marlow authored
Some objects don't need to be scavenged, in particular if they have no pointers. This seems like an obvious optimisation, but in fact it only accounts for about 1% of objects (in GHC, for example), and the extra complication means it probably isn't worth doing.
-
Simon Marlow authored
By establishing an ordering on step pointers, we can simplify the test (stp->gen_no < evac_gen) to (stp < evac_step) which is common in evacuate().
-
Simon Marlow authored
eg. use +RTS -g2 -RTS for 2 threads. Only major GCs are parallelised, minor GCs are still sequential. Don't use more threads than you have CPUs. It works most of the time, although you won't see much speedup yet. Tuning and more work on stability still required.
-
Simon Marlow authored
This patch localises the state of the GC into a gc_thread structure, and reorganises the inner loop of the GC to scavenge one block at a time from global work lists in each "step". The gc_thread structure has a "workspace" for each step, in which it collects evacuated objects until it has a full block to push out to the step's global list. Details of the algorithm will be on the wiki in due course. At the moment, THREADED_RTS does not compile, but the single-threaded GC works (and is 10-20% slower than before).
-
- 11 Oct, 2007 1 commit
-
-
Simon Marlow authored
Previously MVars were always on the mutable list of the old generation, which meant every MVar was visited during every minor GC. With lots of MVars hanging around, this gets expensive. We addressed this problem for MUT_VARs (aka IORefs) a while ago, the solution is to use a traditional GC write-barrier when the object is modified. This patch does the same thing for MVars. TVars are still done the old way, they could probably benefit from the same treatment too.
-
- 26 Oct, 2006 1 commit
-
-
Simon Marlow authored
-
- 24 Oct, 2006 1 commit
-
-
Simon Marlow authored
In preparation for parallel GC, split up the monolithic GC.c file into smaller parts. Also in this patch (and difficult to separate, unfortunatley): - Don't include Stable.h in Rts.h, instead just include it where necessary. - consistently use STATIC_INLINE in source files, and INLINE_HEADER in header files. STATIC_INLINE is now turned off when DEBUG is on, to make debugging easier. - The GC no longer takes the get_roots function as an argument. We weren't making use of this generalisation.
-