Get rid of HEAP_ALLOCED on Windows (and other non-Linux platforms)

Update for readers: we fixed this for Linux with #9706 (closed). But may systems (including Windows) can't handle reserving this much memory. Some other technique is necessary in these cases. This is one possibility.

This bug is to track progress of removing HEAP_ALLOCED from GHC, promising faster GC (especially for large/scattered heaps), as long as we can keep the cost of indirections down.

The relevant wiki article: http://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapAlloced ; we are implementing method 2. Version 2 of the patchset is probably correct.

Blocking problems:

Properly handle the Windows DLL case (e.g. SRTs). We will probably have to reorganize how the indirections are laid out.
~~Make it work for GHCi linking of static objects.~~ Blocked on #2841 (closed), I have it working for ELF, and can make it work for other platforms as soon as I get relevant machines.
Bikeshed hs_main API changes (because closures are not valid prior to RTS initialization, so you have to pass in an indirection instead)
Does not work with unloadObj (see comments below)

Performance improvements possible:

~~This patch introduces a lot of new symbols; ensure we are not unduly polluting the symbol tables. (In particular, I think _static_closure symbols can be made hidden).~~ ~~I've eliminated all of these except for the init symbols, which cross the stub object and assembly file boundary, and so would need to be made invisible by the linker.~~ I needed to make local info tables public.
Don't pay for a double indirection when -fPIC is turned on. Probably the easiest way to do this is to *not* bake in the indirections into compiled code when it is -fPIC'd, and instead scribble over the GOT. However, I don't know how to go backwards from a symbol to a GOT entry, so we might need some heinous assembly hacks to get this working.
The old HEAP_ALLOCED is supposed to be pessimal on very large heaps. Do some performance tests under those workloads.
Make sure the extra indirection is not causing any C-- optimizations to stop firing (it might be, because I put it in as a literal CmmLoad)
Once an static thunk is updated, we can tag the indirection to let other code segments to know about the good news. One way to do this is have the update frame for a static indirection should have a reference to the *indirection*, not the closure itself. However, this scheme will not affect other static closures which have references to the thunk.
Closure tables should have their indirections short-circuited by the initialization code. But maybe it is not worth the cost of telling the RTS about the closure tables (also, they would need to be made writeable).
We are paying an indirection when a GC occurs when the closure is not in R1. According to the wiki page, technically this is not needed, but I don't know how to eliminate references to the closure itself from stg_gc_fun.
~~Save tags inside the indirection tables, so that we don't spend instructions retagging after the following the indirection.~~ Done.
~~Improve static indirection and stable pointer registration, avoiding binary bloat from __attribute(constructor)__ stubs.~~ After discussing this with some folks, it seems that there isn't really a portable way to do this when we are using the system dynamic linker to load libraries at startup. The problem is that we need to somehow get a list of all the GHC-compiled libraries which got loaded, and really the easiest way to get that is to just build it ourselves.
~~Need to implement a new megablock tracking structure so we can free/check for lost blocks~~. Now that efficient lookup is not necessary, perhaps we can write-optimize the megablock tracking structures.

Speculative improvements:

Now that static lives in a block, can we GC them like we GC normal data? This would be beneficial for static thunks, which now can have their indirections completely removed; reverting CAFs may be somewhat tricky, however.
Now that HEAP_ALLOCED is greatly simplified, can we further simply some aspects of the GC? At the very least, we ought to be able to make megablock allocation cheaper, by figuring out how to remove the spinlocks, etc.
Another possibility is to adopt a hybrid approach, where we manually lay out closures when compiling statically, and indirect when compiling dynamically. In some sense, this gets the best of both worlds, since we expect to not pay any extra cost for indirection due to PIC.

Edited Mar 10, 2019 by Edward Z. Yang

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information