- 07 Apr, 2015 1 commit
-
-
Simon Marlow authored
Summary: Hooks rely on static linking semantics, and are broken by -Bsymbolic which we need when using dynamic linking. Test Plan: Built it Reviewers: austin, hvr, tibbe Differential Revision: https://phabricator.haskell.org/D8
-
- 29 Sep, 2014 1 commit
-
-
Simon Marlow authored
This reverts commit 39b5c1cb.
-
- 28 Jul, 2014 1 commit
-
-
Austin Seipp authored
This will hopefully help ensure some basic consistency in the forward by overriding buffer variables. In particular, it sets the wrap length, the offset to 4, and turns off tabs. Signed-off-by:
Austin Seipp <austin@well-typed.com>
-
- 10 Jul, 2014 1 commit
-
-
brbr authored
Summary: Avoid unnecessary clock_gettime() syscalls in GC stats. Test Plan: Use strace. Reviewers: simonmar, austin Reviewed By: simonmar, austin Subscribers: simonmar, relrod, carter Differential Revision: https://phabricator.haskell.org/D39
-
- 03 Oct, 2013 1 commit
-
-
Simon Peyton Jones authored
c.f. commit 0b0fec53
-
- 01 Oct, 2013 2 commits
-
-
Simon Marlow authored
-
Simon Marlow authored
volatile StgWord8 is not guaranteed to be atomic.
-
- 21 Jun, 2013 1 commit
-
-
thoughtpolice authored
rtsBool is defined to only have two inhabitants, which are true (1) and false (0) But the wakeup flag is set to 4 possible values, outside the range of rtsBool. This leads Clang to warn about tautological comparisons. Signed-off-by:
Austin Seipp <aseipp@pobox.com>
-
- 14 Feb, 2013 1 commit
-
-
Simon Marlow authored
We were doing it in two different ways and asserting that the results were the same. In most cases they were, but I found one case where they weren't: the GC itself allocates some memory for running finalizers, and this memory was accounted for one way but not the other. It was simpler to remove the old way of counting allocation that to try to fix it up, so I did that.
-
- 17 Jan, 2013 1 commit
-
-
Simon Marlow authored
Reordering of includes in GC.c broke on OS X because gctKey is declared in Task.h and is needed in the storage manager. This is really the wrong place for it anyway, so I've moved the gctKey pieces to where they should be.
-
- 07 Sep, 2012 1 commit
-
-
Simon Marlow authored
lnat was originally "long unsigned int" but we were using it when we wanted a 64-bit type on a 64-bit machine. This broke on Windows x64, where long == int == 32 bits. Using types of unspecified size is bad, but what we really wanted was a type with N bits on an N-bit machine. StgWord is exactly that. lnat was mentioned in some APIs that clients might be using (e.g. StackOverflowHook()), so we leave it defined but with a comment to say that it's deprecated.
-
- 10 Jul, 2012 1 commit
-
-
Simon Marlow authored
The clearNurseries() operation resets the free pointer in each nursery block to the start of the block, emptying the nursery. In the parallel GC this was done on the main GC thread, but that's bad because it accesses the bdescr of every nursery block, and move all those cache lines onto the CPU of the main GC thread. With large nurseries, this can be especially bad. So instead we want to clear each nursery in its local GC thread. Thanks to Andreas Voellmy <andreas.voellmy@gmail.com> for idenitfying the issue. After this change and the previous patch to make the last GC a major one, I see these results for nofib/parallel on 8 cores: blackscholes +0.0% +0.0% -3.7% -3.3% +0.3% coins +0.0% +0.0% -5.1% -5.0% +0.4% gray +0.0% +0.0% -4.5% -2.1% +0.8% mandel +0.0% -0.0% -7.6% -5.1% -2.3% matmult +0.0% +5.5% -2.8% -1.9% -5.8% minimax +0.0% +0.0% -10.6% -10.5% +0.0% nbody +0.0% -4.4% +0.0% 0.07 +0.0% parfib +0.0% +1.0% +0.5% +0.9% +0.0% partree +0.0% +0.0% -2.4% -2.5% +1.7% prsa +0.0% -0.2% +1.8% +4.2% +0.0% queens +0.0% -0.0% -1.8% -1.4% -4.8% ray +0.0% -0.6% -18.5% -17.8% +0.0% sumeuler +0.0% -0.0% -3.7% -3.7% +0.0% transclos +0.0% -0.0% -25.7% -26.6% +0.0% -------------------------------------------------------------------------------- Min +0.0% -4.4% -25.7% -26.6% -5.8% Max +0.0% +5.5% +1.8% +4.2% +1.7% Geometric Mean +0.0% +0.1% -6.3% -6.1% -0.7%
-
- 13 Dec, 2011 1 commit
-
-
Simon Marlow authored
This is an experimental tweak to the parallel GC that avoids waking up a Capability to do parallel GC if we know that the capability has been idle for a (tunable) number of GC cycles. The idea is that if you're only using a few Capabilities, there's no point waking up the ones that aren't busy. e.g. +RTS -qi3 says "A Capability will participate in parallel GC if it was running at all since the last 3 GC cycles." Results are a bit hit and miss, and I don't completely understand why yet. Hence, for now it is turned off by default, and also not documented except in the +RTS -? output.
-
- 25 Nov, 2011 1 commit
-
-
Simon Marlow authored
Terminology cleanup: the type "Ticks" has been renamed "Time", which is an StgWord64 in units of TIME_RESOLUTION (currently nanoseconds). The terminology "tick" is now used consistently to mean the interval between timer signals. The ticker now always ticks in realtime (actually CLOCK_MONOTONIC if we have it). Before it used CPU time in the non-threaded RTS and realtime in the threaded RTS, but I've discovered that the CPU timer has terrible resolution (at least on Linux) and isn't much use for profiling. So now we always use realtime. This should also fix The default tick interval is now 10ms, except when profiling where we drop it to 1ms. This gives more accurate profiles without affecting runtime too much (<1%). Lots of cleanups - the resolution of Time is now in one place only (Rts.h) rather than having calculations that depend on the resolution scattered all over the RTS. I hope I found them all.
-
- 11 Apr, 2011 1 commit
-
-
Simon Marlow authored
This is a port of some of the changes from my private local-GC branch (which is still in darcs, I haven't converted it to git yet). There are a couple of small functional differences in the GC stats: first, per-thread GC timings should now be more accurate, and secondly we now report average and maximum pause times. e.g. from minimax +RTS -N8 -s: Tot time (elapsed) Avg pause Max pause Gen 0 2755 colls, 2754 par 13.16s 0.93s 0.0003s 0.0150s Gen 1 769 colls, 769 par 3.71s 0.26s 0.0003s 0.0059s
-
- 02 Feb, 2011 1 commit
-
-
Simon Marlow authored
Store the *number* of the destination generation in the Bdescr struct, so that in evacuate() we don't have to deref gen to get it. This is another improvement ported over from my GC branch.
-
- 13 Jul, 2010 1 commit
-
-
Ian Lynagh authored
Which was being used seemed to be random
-
- 17 Jun, 2010 1 commit
-
-
Simon Marlow authored
-
- 03 Dec, 2009 1 commit
-
-
Simon Marlow authored
The GC had a two-level structure, G generations each of T steps. Steps are for aging within a generation, mostly to avoid premature promotion. Measurements show that more than 2 steps is almost never worthwhile, and 1 step is usually worse than 2. In theory fractional steps are possible, so the ideal number of steps is somewhere between 1 and 3. GHC's default has always been 2. We can implement 2 steps quite straightforwardly by having each block point to the generation to which objects in that block should be promoted, so blocks in the nursery point to generation 0, and blocks in gen 0 point to gen 1, and so on. This commit removes the explicit step structures, merging generations with steps, thus simplifying a lot of code. Performance is unaffected. The tunable number of steps is now gone, although it may be replaced in the future by a way to tune the aging in generation 0.
-
- 10 Sep, 2009 1 commit
-
-
Simon Marlow authored
-
- 09 Sep, 2009 1 commit
-
-
Simon Marlow authored
-
- 05 Aug, 2009 1 commit
-
-
Simon Marlow authored
This has no effect with static libraries, but when the RTS is in a shared library it does two things: - it prevents the function from being exposed by the shared library - internal calls to the function can use the faster non-PLT calls, because the function cannot be overriden at link time.
-
- 02 Aug, 2009 1 commit
-
-
Simon Marlow authored
The first phase of this tidyup is focussed on the header files, and in particular making sure we are exposinng publicly exactly what we need to, and no more. - Rts.h now includes everything that the RTS exposes publicly, rather than a random subset of it. - Most of the public header files have moved into subdirectories, and many of them have been renamed. But clients should not need to include any of the other headers directly, just #include the main public headers: Rts.h, HsFFI.h, RtsAPI.h. - All the headers needed for via-C compilation have moved into the stg subdirectory, which is self-contained. Most of the headers for the rest of the RTS APIs have moved into the rts subdirectory. - I left MachDeps.h where it is, because it is so widely used in Haskell code. - I left a deprecated stub for RtsFlags.h in place. The flag structures are now exposed by Rts.h. - Various internal APIs are no longer exposed by public header files. - Various bits of dead code and declarations have been removed - More gcc warnings are turned on, and the RTS code is more warning-clean. - More source files #include "PosixSource.h", and hence only use standard POSIX (1003.1c-1995) interfaces. There is a lot more tidying up still to do, this is just the first pass. I also intend to standardise the names for external RTS APIs (e.g use the rts_ prefix consistently), and declare the internal APIs as hidden for shared libraries.
-
- 20 Apr, 2009 1 commit
-
-
Ben.Lippmeier@anu.edu.au authored
Can't use windowed regs because the window moves during a function call. Can't use the global regs because they're reserved for other purposes.
-
- 04 Apr, 2009 1 commit
-
-
Ian Lynagh authored
With the On x86, use thread-local storage instead of stealing a reg for gct patch, on Windows and OS X: error: thread-local storage not supported for this target
-
- 03 Apr, 2009 2 commits
-
-
Simon Marlow authored
Benchmarks show that using TLS instead of stealing a register is better by a few percent on x86, due to the lack of registers. This only affects -threaded; without -threaded we're (now) using static storage for the GC data.
-
Simon Marlow authored
-
- 13 Mar, 2009 1 commit
-
-
Simon Marlow authored
New flag: "+RTS -qb" disables load-balancing in the parallel GC (though this is subject to change, I think we will probably want to do something more automatic before releasing this). To get the "PARGC3" configuration described in the "Runtime support for Multicore Haskell" paper, use "+RTS -qg0 -qb -RTS". The main advantage of this is that it allows us to easily disable load-balancing altogether, which turns out to be important in parallel programs. Maintaining locality is sometimes more important that spreading the work out in parallel GC. There is a side benefit in that the parallel GC should have improved locality even when load-balancing, because each processor prefers to take work from its own queue before stealing from others.
-
- 12 Jan, 2009 1 commit
-
-
Simon Marlow authored
This turns out to be quite vital for parallel programs: - The way we discover which threads to traverse is by finding dirty threads via the remembered sets (aka mutable lists). - A dirty thread will be on the remembered set of the capability that was running it, and we really want to traverse that thread's stack using the GC thread for the capability, because it is in that CPU's cache. If we get this wrong, we get penalised badly by the memory system. Previously we had per-capability mutable lists but they were aggregated before GC and traversed by just one of the GC threads. This resulted in very poor performance particularly for parallel programs with deep stacks. Now we keep per-capability remembered sets throughout GC, which also removes a lock (recordMutableGen_sync).
-
- 05 Jan, 2009 1 commit
-
-
Ben.Lippmeier@anu.edu.au authored
This makes the build work again.
-
- 21 Nov, 2008 1 commit
-
-
Simon Marlow authored
Previously, the GC had its own pool of threads to use as workers when doing parallel GC. There was a "leader", which was the mutator thread that initiated the GC, and the other threads were taken from the pool. This was simple and worked fine for sequential programs, where we did most of the benchmarking for the parallel GC, but falls down for parallel programs. When we have N mutator threads and N cores, at GC time we would have to stop N-1 mutator threads and start up N-1 GC threads, and hope that the OS schedules them all onto separate cores. It practice it doesn't, as you might expect. Now we use the mutator threads to do GC. This works quite nicely, particularly for parallel programs, where each mutator thread scans its own spark pool, which is probably in its cache anyway. There are some flag changes: -g<n> is removed (-g1 is still accepted for backwards compat). There's no way to have a different number of GC threads than mutator threads now. -q1 Use one OS thread for GC (turns off parallel GC) -qg<n> Use parallel GC for generations >= <n> (default: 1) Using parallel GC only for generations >=1 works well for sequential programs. Compiling an ordinary sequential program with -threaded and running it with -N2 or more should help if you do a lot of GC. I've found that adding -qg0 (do parallel GC for generation 0 too) speeds up some parallel programs, but slows down some sequential programs. Being conservative, I left the threshold at 1. ToDo: document the new options.
-
- 25 Jul, 2008 1 commit
-
-
Simon Marlow authored
-
- 03 Jun, 2008 1 commit
-
-
Simon Marlow authored
-
- 17 Apr, 2008 1 commit
-
-
simonmarhaskell@gmail.com authored
-
- 16 Apr, 2008 6 commits
-
-
simonmarhaskell@gmail.com authored
-
simonmarhaskell@gmail.com authored
-
simonmarhaskell@gmail.com authored
- GCAux.c contains code not compiled with the gct register enabled, it is callable from outside the GC - marking functions are moved to their relevant subsystems, outside the GC - mark_root needs to save the gct register, as it is called from outside the GC
-
simonmarhaskell@gmail.com authored
-
simonmarhaskell@gmail.com authored
- count and report number of parallel collections - calculate bytes scanned in addition to bytes copied per thread - calculate "work balance factor" - tidy up the formatting a bit
-
simonmarhaskell@gmail.com authored
This means we can calculate slop easily, and also improve predictability of GC.
-