- Apr 05, 2024
-
-
For many, many years `GHCForeignImportPrim` has suffered from the rather restrictive limitation of not allowing any non-trivial types in arguments or results. This limitation was justified by the code generator allegely barfing in the presence of such types. However, this restriction appears to originate well before the NCG rewrite and the new NCG does not appear to have any trouble with such types (see the added `T24598` test). Lift this restriction. Fixes #24598.
-
- Apr 04, 2024
-
-
-
Before this patch: data ArgPat p = InvisPat (LHsType p) | VisPat (LPat p) With this patch: data Pat p = ... | InvisPat (LHsType p) ... And the same transformation in the TH land. The rest of the changes is just updating code to handle new AST and writing tests to check if it is possible to create invalid states using TH. Metric Increase: MultiLayerModulesTH_OneShot
-
`Array` contains three additional `Word`'s we do not need in `FlatBag`. Move `FlatBag` to `SmallArray`. Expand the API of SmallArray by `sizeofSmallArray` and add common traversal functions, such as `mapSmallArray` and `foldMapSmallArray`. Additionally, allow users to force the elements of a `SmallArray` via `rnfSmallArray`.
-
LinkedLists are notoriously memory ineffiecient when all we do is traversing a structure. As 'UnlinkedBCO' has been identified as a data structure that impacts the overall memory usage of GHCi sessions, we avoid linked lists and prefer flattened structure for storing. We introduce a new memory efficient representation of sequential elements that has special support for the cases: * Empty * Singleton * Tuple Elements This improves sharing in the 'Empty' case and avoids the overhead of 'Array' until its constant overhead is justified.
-
-
-
- Apr 03, 2024
-
-
This fixes #24582, a small but long-standing bug
-
-
This MR started as: allow the simplifer to do more in one pass, arising from places I could see the simplifier taking two iterations where one would do. But it turned into a larger project, because these changes unexpectedly made inlining blow up, especially join points in deeply-nested cases. The main changes are below. There are also many new or rewritten Notes. Avoiding simplifying repeatedly ~~~~~~~~~~~~~~~ See Note [Avoiding simplifying repeatedly] * The SimplEnv now has a seInlineDepth field, which says how deep in unfoldings we are. See Note [Inline depth] in Simplify.Env. Currently used only for the next point: avoiding repeatedly simplifying coercions. * Avoid repeatedly simplifying coercions. see Note [Avoid re-simplifying coercions] in Simplify.Iteration As you'll see from the Note, this makes use of the seInlineDepth. * Allow Simplify.Iteration.simplAuxBind to inline used-once things. This is another part of Note [Post-inline for single-use things], and is really good for reducing simplifier iterations in situations like case K e of { K x -> blah } wher x is used once in blah. * Make GHC.Core.SimpleOpt.exprIsConApp_maybe do some simple case elimination. Note [Case elim in exprIsConApp_maybe] * Improve the case-merge transformation: - Move the main code to `GHC.Core.Utils.mergeCaseAlts`, to join `filterAlts` and friends. See Note [Merge Nested Cases] in GHC.Core.Utils. - Add a new case for `tagToEnum#`; see wrinkle (MC3). - Add a new case to look through join points: see wrinkle (MC4) postInlineUnconditionally ~~~~~~~~~~~~~~~~~~~~~~~~~ * Allow Simplify.Utils.postInlineUnconditionally to inline variables that are used exactly once. See Note [Post-inline for single-use things]. * Do not postInlineUnconditionally join point, ever. Doing so does not reduce allocation, which is the main point, and with join points that are used a lot it can bloat code. See point (1) of Note [Duplicating join points] in GHC.Core.Opt.Simplify.Iteration. * Do not postInlineUnconditionally a strict (demanded) binding. It will not allocate a thunk (it'll turn into a case instead) so again the main point of inlining it doesn't hold. Better to check per-call-site. * Improve occurrence analyis for bottoming function calls, to help postInlineUnconditionally. See Note [Bottoming function calls] in GHC.Core.Opt.OccurAnal Inlining generally ~~~~~~~~~~~~~~~~~~ * In GHC.Core.Opt.Simplify.Utils.interestingCallContext, use RhsCtxt NonRecursive (not BoringCtxt) for a plain-seq case. See Note [Seq is boring] Also, wrinkle (SB1), inline in that `seq` context only for INLINE functions (UnfWhen guidance). * In GHC.Core.Opt.Simplify.Utils.interestingArg, - return ValueArg for OtherCon [c1,c2, ...], but - return NonTrivArg for OtherCon [] This makes a function a little less likely to inline if all we know is that the argument is evaluated, but nothing else. * isConLikeUnfolding is no longer true for OtherCon {}. This propagates to exprIsConLike. Con-like-ness has /positive/ information. Join points ~~~~~~~~~~~ * Be very careful about inlining join points. See these two long Notes Note [Duplicating join points] in GHC.Core.Opt.Simplify.Iteration Note [Inlining join points] in GHC.Core.Opt.Simplify.Inline * When making join points, don't do so if the join point is so small it will immediately be inlined; check uncondInlineJoin. * In GHC.Core.Opt.Simplify.Inline.tryUnfolding, improve the inlining heuristics for join points. In general we /do not/ want to inline join points /even if they are small/. See Note [Duplicating join points] GHC.Core.Opt.Simplify.Iteration. But sometimes we do: see Note [Inlining join points] in GHC.Core.Opt.Simplify.Inline; and the new `isBetterUnfoldingThan` function. * Do not add an unfolding to a join point at birth. This is a tricky one and has a long Note [Do not add unfoldings to join points at birth] It shows up in two places - In `mkDupableAlt` do not add an inlining - (trickier) In `simplLetUnfolding` don't add an unfolding for a fresh join point I am not fully satisifed with this, but it works and is well documented. * In GHC.Core.Unfold.sizeExpr, make jumps small, so that we don't penalise having a non-inlined join point. Performance changes ~~~~~~~~~~~~~~~~~~~ * Binary sizes fall by around 2.6%, according to nofib. * Compile times improve slightly. Here are the figures over 1%. I investiate the biggest differnce in T18304. It's a very small module, just a few hundred nodes. The large percentage difffence is due to a single function that didn't quite inline before, and does now, making code size a bit bigger. I decided gains outweighed the losses. Metrics: compile_time/bytes allocated (changes over +/- 1%) ------------------------------------------------ CoOpt_Singletons(normal) -9.2% GOOD LargeRecord(normal) -23.5% GOOD MultiComponentModulesRecomp(normal) +1.2% MultiLayerModulesTH_OneShot(normal) +4.1% BAD PmSeriesS(normal) -3.8% PmSeriesV(normal) -1.5% T11195(normal) -1.3% T12227(normal) -20.4% GOOD T12545(normal) -3.2% T12707(normal) -2.1% GOOD T13253(normal) -1.2% T13253-spj(normal) +8.1% BAD T13386(normal) -3.1% GOOD T14766(normal) -2.6% GOOD T15164(normal) -1.4% T15304(normal) +1.2% T15630(normal) -8.2% T15630a(normal) NEW T15703(normal) -14.7% GOOD T16577(normal) -2.3% GOOD T17516(normal) -39.7% GOOD T18140(normal) +1.2% T18223(normal) -17.1% GOOD T18282(normal) -5.0% GOOD T18304(normal) +10.8% BAD T18923(normal) -2.9% GOOD T1969(normal) +1.0% T19695(normal) -1.5% T20049(normal) -12.7% GOOD T21839c(normal) -4.1% GOOD T3064(normal) -1.5% T3294(normal) +1.2% BAD T4801(normal) +1.2% T5030(normal) -15.2% GOOD T5321Fun(normal) -2.2% GOOD T6048(optasm) -16.8% GOOD T783(normal) -1.2% T8095(normal) -6.0% GOOD T9630(normal) -4.7% GOOD T9961(normal) +1.9% BAD WWRec(normal) -1.4% info_table_map_perf(normal) -1.3% parsing001(normal) +1.5% geo. mean -2.0% minimum -39.7% maximum +10.8% * Runtimes generally improve. In the testsuite perf/should_run gives: Metrics: runtime/bytes allocated ------------------------------------------ Conversions(normal) -0.3% T13536a(optasm) -41.7% GOOD T4830(normal) -0.1% haddock.Cabal(normal) -0.1% haddock.base(normal) -0.1% haddock.compiler(normal) -0.1% geo. mean -0.8% minimum -41.7% maximum +0.0% * For runtime, nofib is a better test. The news is mostly good. Here are the number more than +/- 0.1%: # bytes allocated ==========================++========== imaginary/digits-of-e1 || -14.40% imaginary/digits-of-e2 || -4.41% imaginary/paraffins || -0.17% imaginary/rfib || -0.15% imaginary/wheel-sieve2 || -0.10% real/compress || -0.47% real/fluid || -0.10% real/fulsom || +0.14% real/gamteb || -1.47% real/gg || -0.20% real/infer || +0.24% real/pic || -0.23% real/prolog || -0.36% real/scs || -0.46% real/smallpt || +4.03% shootout/k-nucleotide || -20.23% shootout/n-body || -0.42% shootout/spectral-norm || -0.13% spectral/boyer2 || -3.80% spectral/constraints || -0.27% spectral/hartel/ida || -0.82% spectral/mate || -20.34% spectral/para || +0.46% spectral/rewrite || +1.30% spectral/sphere || -0.14% ==========================++========== geom mean || -0.59% real/smallpt has a huge nest of local definitions, and I could not pin down a reason for a regression. But there are three big wins! Metric Decrease: CoOpt_Singletons LargeRecord T12227 T12707 T13386 T13536a T14766 T15703 T16577 T17516 T18223 T18282 T18923 T21839c T20049 T5321Fun T5030 T6048 T8095 T9630 T783 Metric Increase: MultiLayerModulesTH_OneShot T13253-spj T18304 T18698a T9961 T3294
-
Pure refactoring
-
This is a pure refactor
-
Ensure that WorkWrap preserves lambda binders, in case of join points. Sadly I have forgotten why I made this change (it was while I was doing a lot of meddling in the Simplifier, but * it does no harm, * it is slightly more efficient, and * presumably it made something better! Anyway I have kept it in a separate commit.
-
When exploring compile-time regressions after meddling with the Simplifier, I discovered that GHC.HsToCore.Pmc.Solver.Types.trvVarInfo was very delicately balanced. It's a small, heavily used, overloaded function and it's important that it inlines. By a fluke it was before, but at various times in my journey it stopped doing so. So I just added an INLINE pragma to it; no sense in depending on a delicately-balanced fluke.
-
Eliminate a redundant case at birth. This sometimes reduces Simplifier iterations. See Note [Case elim in exprIsConApp_maybe].
-
-
See Note [Eta expanding through CallStacks] in GHC.Core.Opt.Arity This is a one-line change, that fixes an inconsistency - || isCallStackPredTy ty + || isCallStackPredTy ty || isCallStackTy ty
-
See the new Note [Floating join point bindings]. * Completely get rid of the complicated join_ceiling nonsense, which I have never understood. * Do not float join points at all, except perhaps to top level. * Some refactoring around wantToFloat, to treat Rec and NonRec more uniformly
-
* Make `mkSymCo` and `mkInstCo` smarter Fixes #23642 * Fix return role of `SelCo` in the coercion optimiser. Fixes #23617 * Make the coercion optimiser `opt_trans_rule` work better for newtypes Fixes #23619
-
I can't think of any good reason that anything in this MR should have changed the number of allocations, up or down. (Yes this is an empty commit.) Metric Decrease: T12227
-
All the changes are in fact not changes at all. Previously, the IoSubSystem data type was defined in GHC.RTS.Flags and exported from both GHC.RTS.Flags and GHC.IO.SubSystem. Now, the data type is defined in GHC.IO.SubSystem and still exported from both modules. Therefore, the same exports and same instances are still available from both modules. But the base-exports records only the defining module, and so it looks like a change when it is fully compatible. Related: we do add a deprecation to the export of the type via GHC.RTS.Flags, telling people to use the export from GHC.IO.SubSystem. Also the sort order for some unrelated Show instances changed. No idea why. The same changes apply in the other versions, with a few more changes due to sort order weirdness.
-
Some GCC versions don't know about some warnings, and they complain that we're ignoring unknown warnings. So we try to ignore the warning based on the GCC version.
-
Previously it was unclear that they did not work because the code path was shared with other I/O managers (in particular select()). Following the code carefully shows that what actually happens is that the calling thread would block forever: the thread will be put into the blocked queue, but no other action is scheduled that will ever result in it getting unblocked. It's better to just fail loudly in case anyone accidentally calls it, also it's less confusing code.
-
Document the extra +RTS --info output in the user guide
-
Using the new tracer class. Note: The unconditional definition of showIOManager should be compatible with the debugTrace change in 7c7d1f66. Co-authored-by:
Pi Delport <pi@well-typed.com>
-
It's just to make sure an exception CAF is a GC root.
-
Provide an opaque (forward) definition in Capability.h (since the cap contains a *CapIOManager) and then only provide a full definition in a new file IOManagerInternals.h. This new file is only supposed to be included by the IOManager implementation, not by its users. So that means IOManager.c and individual I/O manager implementations. The posix/Signals.c still needs direct access, but that should be eliminated. Anything that needs direct access either needs to be clearly part of an I/O manager (e.g. the sleect() one) or go via a proper API.
-
We need to select the I/O manager to use during startup before the per-cap I/O manager initialisation.
-
Used in setNumCapabilities. It only does anything for MIO on Posix. Previously it always invoked Haskell code, but that code only did anything on non-Windows (and non-JS), and only threaded. That currently effectively means the MIO I/O manager on Posix. So now it only invokes it for the MIO Posix case.
-
When the GC scavenges a TSO it needs to scavenge the tso->blocked_info but the blocked_info is a big union and what lives there depends on the two->why_blocked, which for I/O-related reasons is something that in principle is the responsibility of the I/O manager and not the GC. So the right thing to do is for the GC to ask the I/O manager to sscavenge the blocked_info if it encounters any I/O-related why_blocked reasons. So we add scavengeTSOIOManager in IOManager.{h,c} with the usual style. Now as it happens, right now, there is no special scavenging to do, so the implementation of scavengeTSOIOManager is a fancy no-op. That's because the select I/O manager uses only the fd and target members, which are not GC pointers, and the win32-legacy I/O manager _ought_ to be using GC-managed heap objects for the StgAsyncIOResult but it is actually usingthe C heap, so again no GC pointers. If the win32-legacy were doing this more sensibly, then scavengeTSOIOManager would be the right place to do the GC magic. Future I/O managers will need GC heap objects in the tso->blocked_info and will make use of this functionality.
-
Use the standard #include {Begin,End}Private.h style rather than RTS_PRIVATE on individual decls. And conditionally build the code for the select I/O manager based on the new CPP IOMGR_ENABLED_SELECT rather than on THREADED_RTS.
-
These are now just called from IOManager.c and are the per-I/O manager backend impls (whereas previously awaitEvent was the entry point). Follow the new naming convention in the IOManager.{h,c} of awaitCompletedTimeoutsOrIO, with the I/O manager's name as a suffix: so awaitCompletedTimeoutsOrIO{Select,Win32}.
-
and have the scheduler use it. Previously the scheduler calls awaitEvent directly, and awaitEvent is implemented directly in the RTS I/O managers (select, win32). This relies on the old scheme where there's a single active I/O manager for each platform and RTS way. We want to move that to go via an API in IOManager.{h,c} which can then call out to the active I/O manager. Also take the opportunity to split awaitEvent into two. The existing awaitEvent has a bool wait parameter, to say if the call should be blocking or non-blocking. We split this into two separate functions: pollCompletedTimeoutsOrIO and awaitCompletedTimeoutsOrIO. We split them for a few reasons: they have different post-conditions (specifically the await version is supposed to guarantee that there are threads runnable when it completes). Secondly, it is also anticipated that in future I/O managers the implementations of the two cases will be simpler if they are separated.
-
rather than directly operating on the IO manager's data structures. Specifically, when thowing an async exception to a thread that is blocked waiting for I/O or waiting for a timer, then we want to cancel that I/O waiting or cancel the timer. Currently this is done directly in removeFromQueues() in RaiseAsync.c. We want it to go via proper APIs both for modularity but also to let us support multiple I/O managers. So add sync{IO,Delay}Cancel, which is the cancellation for the corresponding sync{IO,Delay}. The implementations of these use the usual "switch (iomgr_type)" style.
-
It makes sense now for it to be separate from the scheduler class of tracers. Enabled with +RTS -Do. Document the -Do debug flag in the user guide.
-
We have lots of functions with conditional implementations for different I/O managers. Some functions, for some I/O managers, naturally have implementations that do nothing or barf. When only one such I/O manager is enabled then the whole function implementation will have an implementation that does nothing or barfs. This then results in warnings from gcc that parameters are unused, or that the function should be marked with attribute noreturn (since barf does not return). The USED_IF_THREADS trick for fine-grained warning supression is fine for just two cases, but an equivalent here would need USED_IF_THE_ONLY_ENABLED_IOMGR_IS_X_OR_Y which would have combinitorial blowup. So we take a coarse grained approach and simply disable these two warnings for the whole file. So we use a GCC pragma, with its handy push/pop support: #pragma GCC diagnostic push #pragma GCC diagnostic ignored "-Wsuggest-attribute=noreturn" #pragma GCC diagnostic ignored "-Wunused-parameter" ... #pragma GCC diagnostic pop
-
The implementation is eventually going to need to use more private things, which will drag in unwanted includes into IOManager.h, so it's better to move the impl out of the header file and into the .c file, at the slight cost of it no longer being inline. At the same time, change to the "switch (iomgr_type)" style.
-
No longer defined in IOManager.h, just a private function in IOManager.c. Since it is no longer called from cmm code, just from syncDelay. It ought to get moved further into the select() I/O manager impl, rather than living in IOManager.c. On the other hand appendToIOBlockedQueue is still called from cmm code in the win32-legacy I/O manager primops async{Read,Write}#, and it is also used by the select() I/O manager. Update the CPP and comments to reflect this.
-
Moves it into the IOManager.c where we can follow the new pattern of switching on the selected I/O manager. Uses a new IOManager API: syncDelay, following the naming convention of sync* for thread-synchronous I/O & timer/delay operations. As part of porting from cmm to C, we maintain the rule that the why_blocked gets accessed using load acquire and store release atomic memory operations. There was one exception to this rule: in the delay# primop cmm code on posix (not win32), the why_blocked was being updated using a store relaxed, not a store release. I've no idea why. In this convesion I'm playing it safe here and using store release consistently.
-
Moves it into the IOManager.c where we can follow the new pattern of switching on the selected I/O manager.
-