      Rename _closure to _static_closure, apply naming consistently. · 35672072
      In preparation for indirecting all references to closures,
      we rename _closure to _static_closure to ensure any old code
      will get an undefined symbol error.  In order to reference
      a closure foobar_closure (which is now undefined), you should instead
      use STATIC_CLOSURE(foobar).  For convenience, a number of these
      old identifiers are macro'd.
      Across C-- and C (Windows and otherwise), there were differing
      conventions on whether or not foobar_closure or &foobar_closure
      was the address of the closure.  Now, all foobar_closure references
      are addresses, and no & is necessary.
      CHARLIKE/INTLIKE were not changed, simply alpha-renamed.
      Part of remove HEAP_ALLOCED patch set (#8199)
      Depends on D265
      Signed-off-by: Edward Z. Yang's avatarEdward Z. Yang <ezyang@mit.edu>
      Test Plan: validate
      Reviewers: simonmar, austin
      Subscribers: simonmar, ezyang, carter, thomie
      Differential Revision: https://phabricator.haskell.org/D267
      GHC Trac Issues: #8199
      Implement {resize,shrink}MutableByteArray# primops · 246436f1
      The two new primops with the type-signatures
        resizeMutableByteArray# :: MutableByteArray# s -> Int#
                                -> State# s -> (# State# s, MutableByteArray# s #)
        shrinkMutableByteArray# :: MutableByteArray# s -> Int#
                                -> State# s -> State# s
      allow to resize MutableByteArray#s in-place (when possible), and are useful
      for algorithms where memory is temporarily over-allocated. The motivating
      use-case is for implementing integer backends, where the final target size of
      the result is either N or N+1, and only known after the operation has been
      A future commit will implement a stateful variant of the
      `sizeofMutableByteArray#` operation (see #9447 for details), since now the
      size of a `MutableByteArray#` may change over its lifetime (i.e before
      it gets frozen or GCed).
      Test Plan: ./validate --slow
      Reviewers: ezyang, austin, simonmar
      Reviewed By: austin, simonmar
      Differential Revision: https://phabricator.haskell.org/D133
      Add SmallArray# and SmallMutableArray# types · 90329b6c
      These array types are smaller than Array# and MutableArray# and are
      faster when the array size is small, as they don't have the overhead
      of a card table. Having no card table reduces the closure size with 2
      words in the typical small array case and leads to less work when
      updating or GC:ing the array.
      Reduces both the runtime and memory allocation by 8.8% on my insert
      benchmark for the HashMap type in the unordered-containers package,
      which makes use of lots of small arrays. With tuned GC settings
      (i.e. `+RTS -A6M`) the runtime reduction is 15%.
      Fixes #8923.
      Make copy array ops out-of-line by default · e54828bf
      This should reduce code size when there's little to gain from inlining
      these primops, while still retaining the inlining benefit when the
      size of the copy is known statically.
      codeGen: inline allocation optimization for clone array primops · 1eece456
      The inline allocation version is 69% faster than the out-of-line
      version, when cloning an array of 16 unit elements on a 64-bit
      Comparing the new and the old primop implementations isn't
      straightforward. The old version had a missing heap check that I
      discovered during the development of the new version. Comparing the
      old and the new version would requiring fixing the old version, which
      in turn means reimplementing the equivalent of MAYBE_CG in StgCmmPrim.
      The inline allocation threshold is configurable via
      -fmax-inline-alloc-size which gives the maximum array size, in bytes,
      to allocate inline. The size does not include the closure header size.
      Allowing the same primop to be either inline or out-of-line has some
      implication for how we lay out heap checks. We always place a heap
      check around out-of-line primops, as they may allocate outside of our
      knowledge. However, for the inline primops we only allow allocation
      via the standard means (i.e. virtHp). Since the clone primops might be
      either inline or out-of-line the heap check layout code now consults
      shouldInlinePrimOp to know whether a primop will be inlined.
      Remove use of R9, and fix associated bugs · 11b5ce55
      We were passing the function address to stg_gc_prim_p in R9, which was
      wrong because the call was a high-level call and didn't declare R9 as
      a parameter.  Passing R9 as an argument is the right way, but
      unfortunately that exposed another bug: we were using the same macro
      in some low-level Cmm, where it is illegal to call functions with
      arguments (see Note [syntax of cmm files]).  So we now have low-level
      variants of STK_CHK() and STK_CHK_P() for use in low-level Cmm code.
      ticky enhancements · 460abd75
        * the new StgCmmArgRep module breaks a dependency cycle; I also
          untabified it, but made no real changes
        * updated the documentation in the wiki and change the user guide to
          point there
        * moved the allocation enters for ticky and CCS to after the heap check
          * I left LDV where it was, which was before the heap check at least
            once, since I have no idea what it is
        * standardized all (active?) ticky alloc totals to bytes
        * in order to avoid double counting StgCmmLayout.adjustHpBackwards
          no longer bumps ALLOC_HEAP_ctr
        * I resurrected the SLOW_CALL counters
          * the new module StgCmmArgRep breaks cyclic dependency between
            Layout and Ticky (which the SLOW_CALL counters cause)
          * renamed them SLOW_CALL_fast_<pattern> and VERY_SLOW_CALL
        * added ALLOC_RTS_ctr and _tot ticky counters
          * eg allocation by Storage.c:allocate or a BUILD_PAP in stg_ap_*_info
          * resurrected ticky counters for ALLOC_THK, ALLOC_PAP, and
          * added -ticky and -DTICKY_TICKY in ways.mk for debug ways
        * added a ticky counter for total LNE entries
        * new flags for ticky: -ticky-allocd -ticky-dyn-thunk -ticky-LNE
          * all off by default
          * -ticky-allocd: tracks allocation *of* closure in addition to
             allocation *by* that closure
          * -ticky-dyn-thunk tracks dynamic thunks as if they were functions
          * -ticky-LNE tracks LNEs as if they were functions
        * updated the ticky report format, including making the argument
          categories (more?) accurate again
        * the printed name for things in the report include the unique of
          their ticky parent as well as if they are not top-level
      STM: Only wake up once · a23661d2
      Previously, threads blocked on an STM retry would be sent a wakeup
      message each time an unpark was requested. This could result in the
      accumulation of a large number of wake-up messages, which would slow
      wake-up once the sleeping thread is finally scheduled.
      Here, we introduce a new closure type, STM_AWOKEN, which marks a TSO
      which has been sent a wake-up message, allowing us to send only one
      Produce new-style Cmm from the Cmm parser · a7c0387d
      The main change here is that the Cmm parser now allows high-level cmm
      code with argument-passing and function calls.  For example:
      foo ( gcptr a, bits32 b )
        if (b > 0) {
           // we can make tail calls passing arguments:
           jump stg_ap_0_fast(a);
        return (x,y);
      More details on the new cmm syntax are in Note [Syntax of .cmm files]
      in CmmParse.y.
      The old syntax is still more-or-less supported for those occasional
      code fragments that really need to explicitly manipulate the stack.
      However there are a couple of differences: it is now obligatory to
      give a list of live GlobalRegs on every jump, e.g.
        jump %ENTRY_CODE(Sp(0)) [R1];
      Again, more details in Note [Syntax of .cmm files].
      I have rewritten most of the .cmm files in the RTS into the new
      syntax, except for AutoApply.cmm which is generated by the genapply
      program: this file could be generated in the new syntax instead and
      would probably be better off for it, but I ran out of enthusiasm.
      Some other changes in this batch:
       - The PrimOp calling convention is gone, primops now use the ordinary
         NativeNodeCall convention.  This means that primops and "foreign
         import prim" code must be written in high-level cmm, but they can
         now take more than 10 arguments.
       - CmmSink now does constant-folding (should fix #7219)
       - .cmm files now go through the cmmPipeline, and as a result we
         generate better code in many cases.  All the object files generated
         for the RTS .cmm files are now smaller.  Performance should be
         better too, but I haven't measured it yet.
       - RET_DYN frames are removed from the RTS, lots of code goes away
       - we now have some more canned GC points to cover unboxed-tuples with
         2-4 pointers, which will reduce code size a little.
      Some Win64 fixes · b0b76b2e
      Convert some sizes, as CLong is a different size to pointers
      Make profiling work with multiple capabilities (+RTS -N) · 50de6034
      This means that both time and heap profiling work for parallel
      programs.  Main internal changes:
        - CCCS is no longer a global variable; it is now another
          pseudo-register in the StgRegTable struct.  Thus every
          Capability has its own CCCS.
        - There is a new built-in CCS called "IDLE", which records ticks for
          Capabilities in the idle state.  If you profile a single-threaded
          program with +RTS -N2, you'll see about 50% of time in "IDLE".
        - There is appropriate locking in rts/Profiling.c to protect the
          shared cost-centre-stack data structures.
      This patch does enough to get it working, I have cut one big corner:
      the cost-centre-stack data structure is still shared amongst all
      Capabilities, which means that multiple Capabilities will race when
      updating the "allocations" and "entries" fields of a CCS.  Not only
      does this give unpredictable results, but it runs very slowly due to
      cache line bouncing.
      It is strongly recommended that you use -fno-prof-count-entries to
      disable the "entries" count when profiling parallel programs. (I shall
      add a note to this effect to the docs).
      Count allocations more accurately · db0c13a4
      The allocation stats (+RTS -s etc.) used to count the slop at the end
      of each nursery block (except the last) as allocated space, now we
      count the allocated words accurately.  This should make allocation
      figures more predictable, too.
      This has the side effect of reducing the apparent allocations by a
      small amount (~1%), so remember to take this into account when looking
      at nofib results.
      Implement stack chunks and separate TSO/STACK objects · f30d5273
      This patch makes two changes to the way stacks are managed:
      1. The stack is now stored in a separate object from the TSO.
      This means that it is easier to replace the stack object for a thread
      when the stack overflows or underflows; we don't have to leave behind
      the old TSO as an indirection any more.  Consequently, we can remove
      ThreadRelocated and deRefTSO(), which were a pain.
      This is obviously the right thing, but the last time I tried to do it
      it made performance worse.  This time I seem to have cracked it.
      2. Stacks are now represented as a chain of chunks, rather than
         a single monolithic object.
      The big advantage here is that individual chunks are marked clean or
      dirty according to whether they contain pointers to the young
      generation, and the GC can avoid traversing clean stack chunks during
      a young-generation collection.  This means that programs with deep
      stacks will see a big saving in GC overhead when using the default GC
      A secondary advantage is that there is much less copying involved as
      the stack grows.  Programs that quickly grow a deep stack will see big
      In some ways the implementation is simpler, as nothing special needs
      to be done to reclaim stack as the stack shrinks (the GC just recovers
      the dead stack chunks).  On the other hand, we have to manage stack
      underflow between chunks, so there's a new stack frame
      (UNDERFLOW_FRAME), and we now have separate TSO and STACK objects.
      The total amount of code is probably about the same as before.
      There are new RTS flags:
         -ki<size> Sets the initial thread stack size (default 1k)  Egs: -ki4k -ki2m
         -kc<size> Sets the stack chunk size (default 32k)
         -kb<size> Sets the stack chunk buffer size (default 1k)
      -ki was previously called just -k, and the old name is still accepted
      for backwards compatibility.  These new options are documented.
      Remove the IND_OLDGEN and IND_OLDGEN_PERM closure types · 70a2431f
      These are no longer used: once upon a time they used to have different
      layout from IND and IND_PERM respectively, but that is no longer the
      case since we changed the remembered set to be an array of addresses
      instead of a linked list of closures.
      Fix #650: use a card table to mark dirty sections of mutable arrays · 0417404f
      The card table is an array of bytes, placed directly following the
      actual array data.  This means that array reading is unaffected, but
      array writing needs to read the array size from the header in order to
      find the card table.
      We use a bytemap rather than a bitmap, because updating the card table
      must be multi-thread safe.  Each byte refers to 128 entries of the
      array, but this is tunable by changing the constant
      MUT_ARR_PTRS_CARD_BITS in includes/Constants.h.
      GC refactoring, remove "steps" · 214b3663
      The GC had a two-level structure, G generations each of T steps.
      Steps are for aging within a generation, mostly to avoid premature
      Measurements show that more than 2 steps is almost never worthwhile,
      and 1 step is usually worse than 2.  In theory fractional steps are
      possible, so the ideal number of steps is somewhere between 1 and 3.
      GHC's default has always been 2.
      We can implement 2 steps quite straightforwardly by having each block
      point to the generation to which objects in that block should be
      promoted, so blocks in the nursery point to generation 0, and blocks
      in gen 0 point to gen 1, and so on.
      This commit removes the explicit step structures, merging generations
      with steps, thus simplifying a lot of code.  Performance is
      unaffected.  The tunable number of steps is now gone, although it may
      be replaced in the future by a way to tune the aging in generation 0.
      Make allocatePinned use local storage, and other refactorings · 5270423a
      This is a batch of refactoring to remove some of the GC's global
      state, as we move towards CPU-local GC.  
        - allocateLocal() now allocates large objects into the local
          nursery, rather than taking a global lock and allocating
          then in gen 0 step 0.
        - allocatePinned() was still allocating from global storage and
          taking a lock each time, now it uses local storage. 
          (mallocForeignPtrBytes should be faster with -threaded).
        - We had a gen 0 step 0, distinct from the nurseries, which are
          stored in a separate nurseries[] array.  This is slightly strange.
          I removed the g0s0 global that pointed to gen 0 step 0, and
          removed all uses of it.  I think now we don't use gen 0 step 0 at
          all, except possibly when there is only one generation.  Possibly
          more tidying up is needed here.
        - I removed the global allocate() function, and renamed
          allocateLocal() to allocate().
        - the alloc_blocks global is gone.  MAYBE_GC() and
          doYouWantToGC() now check the local nursery only.
      RTS tidyup sweep, first phase · a2a67cd5
      The first phase of this tidyup is focussed on the header files, and in
      particular making sure we are exposinng publicly exactly what we need
      to, and no more.
       - Rts.h now includes everything that the RTS exposes publicly,
         rather than a random subset of it.
       - Most of the public header files have moved into subdirectories, and
         many of them have been renamed.  But clients should not need to
         include any of the other headers directly, just #include the main
         public headers: Rts.h, HsFFI.h, RtsAPI.h.
       - All the headers needed for via-C compilation have moved into the
         stg subdirectory, which is self-contained.  Most of the headers for
         the rest of the RTS APIs have moved into the rts subdirectory.
       - I left MachDeps.h where it is, because it is so widely used in
         Haskell code.
       - I left a deprecated stub for RtsFlags.h in place.  The flag
         structures are now exposed by Rts.h.
       - Various internal APIs are no longer exposed by public header files.
       - Various bits of dead code and declarations have been removed
       - More gcc warnings are turned on, and the RTS code is more
       - More source files #include "PosixSource.h", and hence only use
         standard POSIX (1003.1c-1995) interfaces.
      There is a lot more tidying up still to do, this is just the first
      pass.  I also intend to standardise the names for external RTS APIs
      (e.g use the rts_ prefix consistently), and declare the internal APIs
      as hidden for shared libraries.
