- Apr 12, 2024
-
-
Closes #24609.
-
As proposed in #21461. Closes #21540.
-
See #24634.
-
Fixes #24487
-
The implementation of `checkHomeUnitsClosed` was traversing every single path in the unit dependency graph - this grows exponentially and quickly grows to be infeasible on larger unit dependency graphs. Instead we replace this with a faster implementation which follows from the specificiation of the closure property - there is a closure error if there are units which are both are both (transitively) depended upon by home units and (transitively) depend on home units, but are not themselves home units. To compute the set of units required for closure, we first compute the closure of the unit dependency graph, then the transpose of this closure, and find all units that are reachable from the home units in the transpose of the closure.
-
Over the years the users guide's language extension documentation has gone through quite a few refactorings. In the process some of the descriptions have been rendered non-sensical. For instance, the description of `NoImplicitPrelude` actually describes the semantics of `ImplicitPrelude`. To fix this we: * ensure that all extensions are named in their "positive" sense (e.g. `ImplicitPrelude` rather than `NoImplicitPrelude`). * rework the documentation to avoid flag-oriented wording like "enable" and "disable" * ensure that the polarity of the documentation is consistent with reality. Fixes #23895.
-
- Apr 10, 2024
-
-
The XRec for [LIdP pass] is not needed for exact printing, remove it.
-
-
This commit makes the symbol cache that was used by the external interpreter available for the internal interpreter too. This follows from the analysis in #23415 that suggests the internal interpreter could benefit from this cache too, and that there is no good reason not to have the cache for it too. It also makes it a bit more uniform to have the symbol cache range over both the internal and external interpreter. This commit also refactors the cache into a function which is used by both `lookupSymbol` and also by `lookupSymbolInDLL`, extending the caching logic to `lookupSymbolInDLL` too.
-
Rewrite the implementation of `addDLL` as a wrapper around the more principled `loadNativeObj` rts linker function. The latter should be preferred while the former is preserved for backwards compatibility. `loadNativeObj` was previously only available on ELF platforms, so this commit further refactors the rts linker to transform loadNativeObj_ELF into loadNativeObj_POSIX, which is available in ELF and MachO platforms. The refactor made it possible to remove the `dl_mutex` mutex in favour of always using `linker_mutex` (rather than a combination of both). Lastly, we implement `loadNativeObj` for Windows too.
-
See the primary Note [Looking up symbols in the relevant objects] for a more in-depth explanation. When dynamically loading a Haskell symbol (typical when running a splice or GHCi expression), before this commit we would search for the symbol in all dynamic libraries that were loaded. However, this could be very inefficient when too many packages are loaded (which can happen if there are many package dependencies) because the time to lookup the would be linear in the number of packages loaded. This commit drastically improves symbol loading performance by introducing a mapping from units to the handles of corresponding loaded dlls. These handles are returned by dlopen when we load a dll, and can then be used to look up in a specific dynamic library. Looking up a given Name is now much more precise because we can get lookup its unit in the mapping and lookup the symbol solely in the handles of the dynamic libraries loaded for that unit. In one measurement, the wait time before the expression was executed went from +-38 seconds down to +-2s. This commit also includes Note [Symbols may not be found in pkgs_loaded], explaining the fallback to the old behaviour in case no dll can be found in the unit mapping for a given Name. Fixes #23415 Co-authored-by:
Rodrigo Mesquita <(@alt-romes)>
-
Fixes a memory leak in rts/linker/PEi386.c
-
- Apr 09, 2024
-
-
Hopefully the final fix needed for #24533
-
As there are many 'CompiledByteCode' objects alive during a GHCi session, representing its element in a more packed manner improves space behaviour at a minimal cost. When running GHCi on the agda codebase, we find around 380 live 'CompiledByteCode' objects. Packing their respective 'UnlinkedByteCode' can save quite some pointers.
-
- Apr 08, 2024
-
-
This allows us to use a NoCommentsLocation for the possibly trailing comma location in a StringLiteral. This in turn allows us to correctly roundtrip via makeDeltaAst.
-
This allows us to use an EpDelta if needed when using makeDeltaAst.
-
Thunk analysis showed that we have about 100_000 thunks (in agda and `-fwrite-simplified-core`) pointing to the name of the name decl. Forcing this thunk fixes this issue. The thunk created here is retained by the thunk created by forkM, it is better to eagerly force this because the result (a `Name`) is already retained indirectly via the `IfaceDecl`.
-
-
In `FastStringTable`, we count the number of z-encoded FastStrings that exist in a GHC session. We used to UNPACK the counters to not waste memory, but live retainer analysis showed that we allocate a lot of `FastMutInt`s, retained by `mkFastZString`. We lazily compute the `FastZString`, only incrementing the counter when the `FastZString` is forced. The function `mkFastStringWith` calls `mkZFastString` and boxes the `FastMutInt`, leading to the following core: mkFastStringWith = \ mk_fs _ -> = case stringTable of { FastStringTable _ n_zencs segments# _ -> ... case ((mk_fs (I# ...) (FastMutInt n_zencs)) `cast` <Co:2> :: ...) ... Marking this field as `NOUNPACK` avoids this reboxing, eliminating the allocation of a fresh `FastMutInt` on every `FastString` allocation.
-
`UnlinkedBCO`'s can occur many times in the heap. Each `UnlinkedBCO` references two `UArray`'s but never indexes them. They are only needed to encode the elements into a `ByteArray#`. The three words for the lower bound, upper bound and number of elements are essentially unused, thus we replace `UArray` with a wrapper around `ByteArray#`. This saves us up to three words for each `UnlinkedBCO`. Further, to avoid re-allocating these words for `ResolvedBCO`, we repeat the procedure for `ResolvedBCO` and add custom `Binary` and `Show` instances. For example, agda's repl session has around 360_000 UnlinkedBCO's, so avoiding these three words is already saving us around 8MB residency.
-
- Apr 05, 2024
-
-
-
So we can update it to a delta position in makeDeltaAst if needed.
-
This allows us to update them for makeDeltaAst in ghc-exactprint
-
For many, many years `GHCForeignImportPrim` has suffered from the rather restrictive limitation of not allowing any non-trivial types in arguments or results. This limitation was justified by the code generator allegely barfing in the presence of such types. However, this restriction appears to originate well before the NCG rewrite and the new NCG does not appear to have any trouble with such types (see the added `T24598` test). Lift this restriction. Fixes #24598.
-
- Apr 04, 2024
-
-
-
Before this patch: data ArgPat p = InvisPat (LHsType p) | VisPat (LPat p) With this patch: data Pat p = ... | InvisPat (LHsType p) ... And the same transformation in the TH land. The rest of the changes is just updating code to handle new AST and writing tests to check if it is possible to create invalid states using TH. Metric Increase: MultiLayerModulesTH_OneShot
-
`Array` contains three additional `Word`'s we do not need in `FlatBag`. Move `FlatBag` to `SmallArray`. Expand the API of SmallArray by `sizeofSmallArray` and add common traversal functions, such as `mapSmallArray` and `foldMapSmallArray`. Additionally, allow users to force the elements of a `SmallArray` via `rnfSmallArray`.
-
LinkedLists are notoriously memory ineffiecient when all we do is traversing a structure. As 'UnlinkedBCO' has been identified as a data structure that impacts the overall memory usage of GHCi sessions, we avoid linked lists and prefer flattened structure for storing. We introduce a new memory efficient representation of sequential elements that has special support for the cases: * Empty * Singleton * Tuple Elements This improves sharing in the 'Empty' case and avoids the overhead of 'Array' until its constant overhead is justified.
-
-
-
- Apr 03, 2024
-
-
This fixes #24582, a small but long-standing bug
-
-
This MR started as: allow the simplifer to do more in one pass, arising from places I could see the simplifier taking two iterations where one would do. But it turned into a larger project, because these changes unexpectedly made inlining blow up, especially join points in deeply-nested cases. The main changes are below. There are also many new or rewritten Notes. Avoiding simplifying repeatedly ~~~~~~~~~~~~~~~ See Note [Avoiding simplifying repeatedly] * The SimplEnv now has a seInlineDepth field, which says how deep in unfoldings we are. See Note [Inline depth] in Simplify.Env. Currently used only for the next point: avoiding repeatedly simplifying coercions. * Avoid repeatedly simplifying coercions. see Note [Avoid re-simplifying coercions] in Simplify.Iteration As you'll see from the Note, this makes use of the seInlineDepth. * Allow Simplify.Iteration.simplAuxBind to inline used-once things. This is another part of Note [Post-inline for single-use things], and is really good for reducing simplifier iterations in situations like case K e of { K x -> blah } wher x is used once in blah. * Make GHC.Core.SimpleOpt.exprIsConApp_maybe do some simple case elimination. Note [Case elim in exprIsConApp_maybe] * Improve the case-merge transformation: - Move the main code to `GHC.Core.Utils.mergeCaseAlts`, to join `filterAlts` and friends. See Note [Merge Nested Cases] in GHC.Core.Utils. - Add a new case for `tagToEnum#`; see wrinkle (MC3). - Add a new case to look through join points: see wrinkle (MC4) postInlineUnconditionally ~~~~~~~~~~~~~~~~~~~~~~~~~ * Allow Simplify.Utils.postInlineUnconditionally to inline variables that are used exactly once. See Note [Post-inline for single-use things]. * Do not postInlineUnconditionally join point, ever. Doing so does not reduce allocation, which is the main point, and with join points that are used a lot it can bloat code. See point (1) of Note [Duplicating join points] in GHC.Core.Opt.Simplify.Iteration. * Do not postInlineUnconditionally a strict (demanded) binding. It will not allocate a thunk (it'll turn into a case instead) so again the main point of inlining it doesn't hold. Better to check per-call-site. * Improve occurrence analyis for bottoming function calls, to help postInlineUnconditionally. See Note [Bottoming function calls] in GHC.Core.Opt.OccurAnal Inlining generally ~~~~~~~~~~~~~~~~~~ * In GHC.Core.Opt.Simplify.Utils.interestingCallContext, use RhsCtxt NonRecursive (not BoringCtxt) for a plain-seq case. See Note [Seq is boring] Also, wrinkle (SB1), inline in that `seq` context only for INLINE functions (UnfWhen guidance). * In GHC.Core.Opt.Simplify.Utils.interestingArg, - return ValueArg for OtherCon [c1,c2, ...], but - return NonTrivArg for OtherCon [] This makes a function a little less likely to inline if all we know is that the argument is evaluated, but nothing else. * isConLikeUnfolding is no longer true for OtherCon {}. This propagates to exprIsConLike. Con-like-ness has /positive/ information. Join points ~~~~~~~~~~~ * Be very careful about inlining join points. See these two long Notes Note [Duplicating join points] in GHC.Core.Opt.Simplify.Iteration Note [Inlining join points] in GHC.Core.Opt.Simplify.Inline * When making join points, don't do so if the join point is so small it will immediately be inlined; check uncondInlineJoin. * In GHC.Core.Opt.Simplify.Inline.tryUnfolding, improve the inlining heuristics for join points. In general we /do not/ want to inline join points /even if they are small/. See Note [Duplicating join points] GHC.Core.Opt.Simplify.Iteration. But sometimes we do: see Note [Inlining join points] in GHC.Core.Opt.Simplify.Inline; and the new `isBetterUnfoldingThan` function. * Do not add an unfolding to a join point at birth. This is a tricky one and has a long Note [Do not add unfoldings to join points at birth] It shows up in two places - In `mkDupableAlt` do not add an inlining - (trickier) In `simplLetUnfolding` don't add an unfolding for a fresh join point I am not fully satisifed with this, but it works and is well documented. * In GHC.Core.Unfold.sizeExpr, make jumps small, so that we don't penalise having a non-inlined join point. Performance changes ~~~~~~~~~~~~~~~~~~~ * Binary sizes fall by around 2.6%, according to nofib. * Compile times improve slightly. Here are the figures over 1%. I investiate the biggest differnce in T18304. It's a very small module, just a few hundred nodes. The large percentage difffence is due to a single function that didn't quite inline before, and does now, making code size a bit bigger. I decided gains outweighed the losses. Metrics: compile_time/bytes allocated (changes over +/- 1%) ------------------------------------------------ CoOpt_Singletons(normal) -9.2% GOOD LargeRecord(normal) -23.5% GOOD MultiComponentModulesRecomp(normal) +1.2% MultiLayerModulesTH_OneShot(normal) +4.1% BAD PmSeriesS(normal) -3.8% PmSeriesV(normal) -1.5% T11195(normal) -1.3% T12227(normal) -20.4% GOOD T12545(normal) -3.2% T12707(normal) -2.1% GOOD T13253(normal) -1.2% T13253-spj(normal) +8.1% BAD T13386(normal) -3.1% GOOD T14766(normal) -2.6% GOOD T15164(normal) -1.4% T15304(normal) +1.2% T15630(normal) -8.2% T15630a(normal) NEW T15703(normal) -14.7% GOOD T16577(normal) -2.3% GOOD T17516(normal) -39.7% GOOD T18140(normal) +1.2% T18223(normal) -17.1% GOOD T18282(normal) -5.0% GOOD T18304(normal) +10.8% BAD T18923(normal) -2.9% GOOD T1969(normal) +1.0% T19695(normal) -1.5% T20049(normal) -12.7% GOOD T21839c(normal) -4.1% GOOD T3064(normal) -1.5% T3294(normal) +1.2% BAD T4801(normal) +1.2% T5030(normal) -15.2% GOOD T5321Fun(normal) -2.2% GOOD T6048(optasm) -16.8% GOOD T783(normal) -1.2% T8095(normal) -6.0% GOOD T9630(normal) -4.7% GOOD T9961(normal) +1.9% BAD WWRec(normal) -1.4% info_table_map_perf(normal) -1.3% parsing001(normal) +1.5% geo. mean -2.0% minimum -39.7% maximum +10.8% * Runtimes generally improve. In the testsuite perf/should_run gives: Metrics: runtime/bytes allocated ------------------------------------------ Conversions(normal) -0.3% T13536a(optasm) -41.7% GOOD T4830(normal) -0.1% haddock.Cabal(normal) -0.1% haddock.base(normal) -0.1% haddock.compiler(normal) -0.1% geo. mean -0.8% minimum -41.7% maximum +0.0% * For runtime, nofib is a better test. The news is mostly good. Here are the number more than +/- 0.1%: # bytes allocated ==========================++========== imaginary/digits-of-e1 || -14.40% imaginary/digits-of-e2 || -4.41% imaginary/paraffins || -0.17% imaginary/rfib || -0.15% imaginary/wheel-sieve2 || -0.10% real/compress || -0.47% real/fluid || -0.10% real/fulsom || +0.14% real/gamteb || -1.47% real/gg || -0.20% real/infer || +0.24% real/pic || -0.23% real/prolog || -0.36% real/scs || -0.46% real/smallpt || +4.03% shootout/k-nucleotide || -20.23% shootout/n-body || -0.42% shootout/spectral-norm || -0.13% spectral/boyer2 || -3.80% spectral/constraints || -0.27% spectral/hartel/ida || -0.82% spectral/mate || -20.34% spectral/para || +0.46% spectral/rewrite || +1.30% spectral/sphere || -0.14% ==========================++========== geom mean || -0.59% real/smallpt has a huge nest of local definitions, and I could not pin down a reason for a regression. But there are three big wins! Metric Decrease: CoOpt_Singletons LargeRecord T12227 T12707 T13386 T13536a T14766 T15703 T16577 T17516 T18223 T18282 T18923 T21839c T20049 T5321Fun T5030 T6048 T8095 T9630 T783 Metric Increase: MultiLayerModulesTH_OneShot T13253-spj T18304 T18698a T9961 T3294
-
Pure refactoring
-
This is a pure refactor
-
Ensure that WorkWrap preserves lambda binders, in case of join points. Sadly I have forgotten why I made this change (it was while I was doing a lot of meddling in the Simplifier, but * it does no harm, * it is slightly more efficient, and * presumably it made something better! Anyway I have kept it in a separate commit.
-
When exploring compile-time regressions after meddling with the Simplifier, I discovered that GHC.HsToCore.Pmc.Solver.Types.trvVarInfo was very delicately balanced. It's a small, heavily used, overloaded function and it's important that it inlines. By a fluke it was before, but at various times in my journey it stopped doing so. So I just added an INLINE pragma to it; no sense in depending on a delicately-balanced fluke.
-
Eliminate a redundant case at birth. This sometimes reduces Simplifier iterations. See Note [Case elim in exprIsConApp_maybe].
-
-