- 08 Sep, 2013 2 commits
-
-
thoughtpolice authored
This reverts commit d85044f6.
-
thoughtpolice authored
When servicing a stack overflows, only throw an exception to the given thread if the user explicitly set a max stack size, using +RTS -K. Otherwise just service it normally and grow the stack. In case we actually run out of *heap* (stack chuncks are allocated on the heap), then we need to bail by calling the stackOverflow() hook and exit immediately. Authored-by:
Ben Gamari <bgamari.foss@gmail.com> Signed-off-by:
Austin Seipp <aseipp@pobox.com>
-
- 04 Sep, 2013 1 commit
-
-
Simon Marlow authored
We have various problems with reallocating the array of Capabilities, due to threads in waitForReturnCapability that are already holding a pointer to a Capability. Rather than add more locking to make this safer, I decided it would be easier to ensure that we never move the Capabilities at all. The capabilities array is now an array of pointers to Capabaility. There are extra indirections, but it rarely matters - we don't often access Capabilities via the array, normally we already have a pointer to one. I ran the parallel benchmarks and didn't see any difference.
-
- 15 Jun, 2013 1 commit
-
-
aljee@hyper.cx authored
-
- 08 Jun, 2013 1 commit
-
-
ian@well-typed.com authored
Based on a patch from Stephen Blackheath.
-
- 29 Mar, 2013 1 commit
-
-
nfrisby authored
* the new StgCmmArgRep module breaks a dependency cycle; I also untabified it, but made no real changes * updated the documentation in the wiki and change the user guide to point there * moved the allocation enters for ticky and CCS to after the heap check * I left LDV where it was, which was before the heap check at least once, since I have no idea what it is * standardized all (active?) ticky alloc totals to bytes * in order to avoid double counting StgCmmLayout.adjustHpBackwards no longer bumps ALLOC_HEAP_ctr * I resurrected the SLOW_CALL counters * the new module StgCmmArgRep breaks cyclic dependency between Layout and Ticky (which the SLOW_CALL counters cause) * renamed them SLOW_CALL_fast_<pattern> and VERY_SLOW_CALL * added ALLOC_RTS_ctr and _tot ticky counters * eg allocation by Storage.c:allocate or a BUILD_PAP in stg_ap_*_info * resurrected ticky counters for ALLOC_THK, ALLOC_PAP, and ALLOC_PRIM * added -ticky and -DTICKY_TICKY in ways.mk for debug ways * added a ticky counter for total LNE entries * new flags for ticky: -ticky-allocd -ticky-dyn-thunk -ticky-LNE * all off by default * -ticky-allocd: tracks allocation *of* closure in addition to allocation *by* that closure * -ticky-dyn-thunk tracks dynamic thunks as if they were functions * -ticky-LNE tracks LNEs as if they were functions * updated the ticky report format, including making the argument categories (more?) accurate again * the printed name for things in the report include the unique of their ticky parent as well as if they are not top-level
-
- 14 Feb, 2013 1 commit
-
-
Simon Marlow authored
We were doing it in two different ways and asserting that the results were the same. In most cases they were, but I found one case where they weren't: the GC itself allocates some memory for running finalizers, and this memory was accounted for one way but not the other. It was simpler to remove the old way of counting allocation that to try to fix it up, so I did that.
-
- 17 Jan, 2013 1 commit
-
-
Simon Marlow authored
Reordering of includes in GC.c broke on OS X because gctKey is declared in Task.h and is needed in the storage manager. This is really the wrong place for it anyway, so I've moved the gctKey pieces to where they should be.
-
- 16 Nov, 2012 1 commit
-
-
Simon Marlow authored
This improves GC performance when there are a lot of TVars in the heap. For instance, a TChan with a lot of elements causes a massive GC drag without this patch. There's more to do - several other STM closure types don't have write barriers, so GC performance when there are a lot of threads blocked on STM isn't great. But fixing the problem for TVar is a good start.
-
- 21 Sep, 2012 3 commits
-
-
Simon Marlow authored
This broke with the changes to the pinned object handling in 67f4ab7e.
-
Simon Marlow authored
The program in #7257 was spending 90% of its time counting the live data in gen->large_objects. We already avoid doing this for small objects, but in this example the old generation was full of large objects (actually pinned ByteStrings).
-
Simon Marlow authored
Forcing large allocations here can creates serious fragmentation in some cases, and since the large allocations are only a small optimisation we should allow the nursery to hoover up small blocks before allocating large chunks.
-
- 07 Sep, 2012 3 commits
-
-
Simon Marlow authored
-
Simon Marlow authored
lnat was originally "long unsigned int" but we were using it when we wanted a 64-bit type on a 64-bit machine. This broke on Windows x64, where long == int == 32 bits. Using types of unspecified size is bad, but what we really wanted was a type with N bits on an N-bit machine. StgWord is exactly that. lnat was mentioned in some APIs that clients might be using (e.g. StackOverflowHook()), so we leave it defined but with a comment to say that it's deprecated.
-
Simon Marlow authored
-
- 21 Aug, 2012 2 commits
-
-
Simon Marlow authored
-
Simon Marlow authored
The calculation should be done in one place, of course.
-
- 10 Jul, 2012 1 commit
-
-
Simon Marlow authored
The clearNurseries() operation resets the free pointer in each nursery block to the start of the block, emptying the nursery. In the parallel GC this was done on the main GC thread, but that's bad because it accesses the bdescr of every nursery block, and move all those cache lines onto the CPU of the main GC thread. With large nurseries, this can be especially bad. So instead we want to clear each nursery in its local GC thread. Thanks to Andreas Voellmy <andreas.voellmy@gmail.com> for idenitfying the issue. After this change and the previous patch to make the last GC a major one, I see these results for nofib/parallel on 8 cores: blackscholes +0.0% +0.0% -3.7% -3.3% +0.3% coins +0.0% +0.0% -5.1% -5.0% +0.4% gray +0.0% +0.0% -4.5% -2.1% +0.8% mandel +0.0% -0.0% -7.6% -5.1% -2.3% matmult +0.0% +5.5% -2.8% -1.9% -5.8% minimax +0.0% +0.0% -10.6% -10.5% +0.0% nbody +0.0% -4.4% +0.0% 0.07 +0.0% parfib +0.0% +1.0% +0.5% +0.9% +0.0% partree +0.0% +0.0% -2.4% -2.5% +1.7% prsa +0.0% -0.2% +1.8% +4.2% +0.0% queens +0.0% -0.0% -1.8% -1.4% -4.8% ray +0.0% -0.6% -18.5% -17.8% +0.0% sumeuler +0.0% -0.0% -3.7% -3.7% +0.0% transclos +0.0% -0.0% -25.7% -26.6% +0.0% -------------------------------------------------------------------------------- Min +0.0% -4.4% -25.7% -26.6% -5.8% Max +0.0% +5.5% +1.8% +4.2% +1.7% Geometric Mean +0.0% +0.1% -6.3% -6.1% -0.7%
-
- 30 Apr, 2012 1 commit
-
-
Ian Lynagh authored
It was causing assertion failures of ASSERT(countBlocks(nursery->blocks) == nursery->n_blocks) at ghc-stage2: internal error: ASSERTION FAILED: file rts/sm/Sanity.c, line 878
-
- 04 Apr, 2012 3 commits
-
-
Duncan Coutts authored
In stat_exit we want to emit a final EVENT_HEAP_ALLOCATED for each cap so that we get the same total allocation count as reported via +RTS -s. To do so we need to update the per-cap total_allocated counts. Previously we had a single calcAllocated(rtsBool) function that counted the large allocations and optionally the nurseries for all caps. The GC would always call it with false, and the stat_exit always with true. The reason for these two modes is that the GC counts the nurseries via clearNurseries() (which also updates the per-cap total_allocated counts), so it's only the stat_exit() path that needs to count them. We now split the calcAllocated() function into two: countLargeAllocated and updateNurseriesStats. As the name suggests, the latter now updates the per-cap total_allocated counts, in additon to returning a total.
-
Duncan Coutts authored
They cover much the same info as is available via the GHC.Stats module or via the '+RTS -s' textual output, but via the eventlog and with a better sampling frequency. We have three new generic heap info events and two very GHC-specific ones. (The hope is the general ones are usable by other implementations that use the same eventlog system, or indeed not so sensitive to changes in GHC itself.) The general ones are: * total heap mem allocated since prog start, on a per-HEC basis * current size of the heap (MBlocks reserved from OS for the heap) * current size of live data in the heap Currently these are all emitted by GHC at GC time (live data only at major GC). The GHC specific ones are: * an event giving various static heap paramaters: * number of generations (usually 2) * max size if any * nursary size * MBlock and block sizes * a event emitted on each GC containing: * GC generation (usually just 0,1) * total bytes copied * bytes lost to heap slop and fragmentation * the number of threads in the parallel GC (1 for serial) * the maximum number of bytes copied by any par GC thread * the total number of bytes copied by all par GC threads (these last three can be used to calculate an estimate of the work balance in parallel GCs)
-
Duncan Coutts authored
In addition to the existing global method. For now we just do it both ways and assert they give the same grand total. At some stage we can simplify the global method to just take the sum of the per-cap counters.
-
- 27 Feb, 2012 2 commits
-
-
Gabor Greif authored
-
Gabor Greif authored
-
- 13 Feb, 2012 1 commit
-
-
Simon Marlow authored
allocator. Prompted by a benchmark posted to parallel-haskell@haskell.org by Andreas Voellmy <andreas.voellmy@gmail.com>. This program exhibits contention for the block allocator when run with -N2 and greater without the fix: {-# LANGUAGE MagicHash, UnboxedTuples, BangPatterns #-} module Main where import Control.Monad import Control.Concurrent import System.Environment import GHC.IO import GHC.Exts import GHC.Conc main = do [m] <- fmap (fmap read) getArgs n <- getNumCapabilities ms <- replicateM n newEmptyMVar sequence [ forkIO $ busyWorkerB (m `quot` n) >> putMVar mv () | mv <- ms ] mapM takeMVar ms busyWorkerB :: Int -> IO () busyWorkerB n_loops = go 0 where go !n | n >= n_loops = return () | otherwise = do p <- (IO $ \s -> case newPinnedByteArray# 1024# s of { (# s', mbarr# #) -> (# s', () #) } ) go (n+1)
-
- 13 Dec, 2011 2 commits
-
-
Simon Marlow authored
-
Simon Marlow authored
-
- 06 Dec, 2011 1 commit
-
-
Simon Marlow authored
At present the number of capabilities can only be *increased*, not decreased. The latter presents a few more challenges!
-
- 29 Nov, 2011 1 commit
-
-
Simon Marlow authored
This means that both time and heap profiling work for parallel programs. Main internal changes: - CCCS is no longer a global variable; it is now another pseudo-register in the StgRegTable struct. Thus every Capability has its own CCCS. - There is a new built-in CCS called "IDLE", which records ticks for Capabilities in the idle state. If you profile a single-threaded program with +RTS -N2, you'll see about 50% of time in "IDLE". - There is appropriate locking in rts/Profiling.c to protect the shared cost-centre-stack data structures. This patch does enough to get it working, I have cut one big corner: the cost-centre-stack data structure is still shared amongst all Capabilities, which means that multiple Capabilities will race when updating the "allocations" and "entries" fields of a CCS. Not only does this give unpredictable results, but it runs very slowly due to cache line bouncing. It is strongly recommended that you use -fno-prof-count-entries to disable the "entries" count when profiling parallel programs. (I shall add a note to this effect to the docs).
-
- 02 Nov, 2011 1 commit
-
-
Simon Marlow authored
User visible changes ==================== Profilng -------- Flags renamed (the old ones are still accepted for now): OLD NEW --------- ------------ -auto-all -fprof-auto -auto -fprof-exported -caf-all -fprof-cafs New flags: -fprof-auto Annotates all bindings (not just top-level ones) with SCCs -fprof-top Annotates just top-level bindings with SCCs -fprof-exported Annotates just exported bindings with SCCs -fprof-no-count-entries Do not maintain entry counts when profiling (can make profiled code go faster; useful with heap profiling where entry counts are not used) Cost-centre stacks have a new semantics, which should in most cases result in more useful and intuitive profiles. If you find this not to be the case, please let me know. This is the area where I have been experimenting most, and the current solution is probably not the final version, however it does address all the outstanding bugs and seems to be better than GHC 7.2. Stack traces ------------ +RTS -xc now gives more information. If the exception originates from a CAF (as is common, because GHC tends to lift exceptions out to the top-level), then the RTS walks up the stack and reports the stack in the enclosing update frame(s). Result: +RTS -xc is much more useful now - but you still have to compile for profiling to get it. I've played around a little with adding 'head []' to GHC itself, and +RTS -xc does pinpoint the problem quite accurately. I plan to add more facilities for stack tracing (e.g. in GHCi) in the future. Coverage (HPC) -------------- * derived instances are now coloured yellow if they weren't used * likewise record field names * entry counts are more accurate (hpc --fun-entry-count) * tab width is now correct (markup was previously off in source with tabs) Internal changes ================ In Core, the Note constructor has been replaced by Tick (Tickish b) (Expr b) which is used to represent all the kinds of source annotation we support: profiling SCCs, HPC ticks, and GHCi breakpoints. Depending on the properties of the Tickish, different transformations apply to Tick. See CoreUtils.mkTick for details. Tickets ======= This commit closes the following tickets, test cases to follow: - Close #2552: not a bug, but the behaviour is now more intuitive (test is T2552) - Close #680 (test is T680) - Close #1531 (test is result001) - Close #949 (test is T949) - Close #2466: test case has bitrotted (doesn't compile against current version of vector-space package)
-
- 17 Oct, 2011 1 commit
-
-
Simon Marlow authored
See Note [atomic CAFs] in rts/sm/Storage.c
-
- 14 Apr, 2011 1 commit
-
-
Simon Marlow authored
The pinned_object_block is where we allocate small pinned ByteArray# objects. At a GC the pinned_object_block was being treated like other large objects and promoted to the next step/generation, even if it was only partly full. Under some ByteString-heavy workloads this would accumulate on average 2k of slop per GC, and this memory is never released until the ByteArray# objects in the block are freed. So now, we keep allocating into the pinned_object_block until it is completely full, at which point it is handed over to the GC as before. The pinned_object_block might therefore contain objects which a large range of ages, but I don't think this is any worse than the situation before. We still have the fragmentation issue in general, but the new scheme can improve the memory overhead for some workloads dramatically.
-
- 02 Feb, 2011 3 commits
-
-
Simon Marlow authored
-
Simon Marlow authored
Now we keep any partially-full blocks in the gc_thread[] structs after each GC, rather than moving them to the generation. This should give us slightly better locality (though I wasn't able to measure any difference). Also in this patch: better sanity checking with THREADED.
-
Simon Marlow authored
Now that we use the per-capability mutable lists exclusively.
-
- 21 Dec, 2010 1 commit
-
-
Simon Marlow authored
The allocation stats (+RTS -s etc.) used to count the slop at the end of each nursery block (except the last) as allocated space, now we count the allocated words accurately. This should make allocation figures more predictable, too. This has the side effect of reducing the apparent allocations by a small amount (~1%), so remember to take this into account when looking at nofib results.
-
- 15 Dec, 2010 1 commit
-
-
Simon Marlow authored
This patch makes two changes to the way stacks are managed: 1. The stack is now stored in a separate object from the TSO. This means that it is easier to replace the stack object for a thread when the stack overflows or underflows; we don't have to leave behind the old TSO as an indirection any more. Consequently, we can remove ThreadRelocated and deRefTSO(), which were a pain. This is obviously the right thing, but the last time I tried to do it it made performance worse. This time I seem to have cracked it. 2. Stacks are now represented as a chain of chunks, rather than a single monolithic object. The big advantage here is that individual chunks are marked clean or dirty according to whether they contain pointers to the young generation, and the GC can avoid traversing clean stack chunks during a young-generation collection. This means that programs with deep stacks will see a big saving in GC overhead when using the default GC settings. A secondary advantage is that there is much less copying involved as the stack grows. Programs that quickly grow a deep stack will see big improvements. In some ways the implementation is simpler, as nothing special needs to be done to reclaim stack as the stack shrinks (the GC just recovers the dead stack chunks). On the other hand, we have to manage stack underflow between chunks, so there's a new stack frame (UNDERFLOW_FRAME), and we now have separate TSO and STACK objects. The total amount of code is probably about the same as before. There are new RTS flags: -ki<size> Sets the initial thread stack size (default 1k) Egs: -ki4k -ki2m -kc<size> Sets the stack chunk size (default 32k) -kb<size> Sets the stack chunk buffer size (default 1k) -ki was previously called just -k, and the old name is still accepted for backwards compatibility. These new options are documented.
-
- 27 Aug, 2010 1 commit
-
-
Simon Marlow authored
I'm not sure if this could lead to a crash or not, but it upsets +RTS -DS Might be related to #4265
-
- 28 Jun, 2010 1 commit
-
-
Simon Marlow authored
-
- 04 Jun, 2010 1 commit
-
-
Simon Marlow authored
-