Builds using GHC's LLVM backend are currently substantially slower than GHC's own native code generator. There are a few possible reasons for this,
the cost of forking processes and serializing/parsing LLVM's intermediate representation
the cost of the more powerful optimizations responsible for LLVM's (hopefully) better code generation
the cost of redundant optimizations overlapping effort already expended by GHC
Given that some architecture (e.g. ARM) are only supported by LLVM and therefore suffer considerably at the hand of our slow builds, we should try to reduce 3.
Kavon, I noticed you mentioned evaluating which LLVM optimizations passes are worthwhile in ticket:10074#comment:136021. This is the ticket I created a while back to track this. Feel free to leave notes here when you end up picking this up.
Ben, I think the autotuner is now ready to be applied to Haskell programs: it can successfully tune C/C++ programs, with or without running the output program, to match -O3 in a reasonable time.
If it's not too late, I can try to get something in for 8.4. My plan is to pick a few "representative" programs[*], tune on each one individually, and see which pass ordering does the best across nofib. If one of them does better than the -Ox passes, we could drop it in GHC pretty easily.
What do you think?
[*] Suggestions are welcome here. Infrastructure to tune against the "average" of several programs (using a special objective function) would be ideal, but that infrastructure is not ready yet.
I'll definitely get something in for 8.6, whether auto-generated or hand-crafted. I've had some good hand-crafted pass sequences sitting around for a while now.
Progress on the autotuner has stalled because (1) it finds weird bugs in LLVM sometimes, and I should just fix them instead of trying to avoid them, (2) I'll be working with some people this summer to create a proper autotuner for LLVM, so we can hopefully use that instead of what I hacked together.
Hi Kavon, Ben,
Is there a way I could help out here? If this isn't stalled on other LLVM issues, I'd be happy to do the necessary admin around Kavon's hand-crafted stuff, or run random search on optimization flags to find a reasonable baseline.
Hi @baskaransri . Yes I would be happy to help out with this. For a bit of history:
First I started with infrastructure to automatically test hand-crafted optimization pass sequences, and evaluated them using GHC itself. I believe the patch to have GHC read the passes to use landed in GHC proper, so you'll see that the script writes to the llvm-passes file in GHC's root for GHC to read and use when the script invokes the nofib benchmark suite, saving the results. Each directory containing roundN for some number N is a round of me running the script overnight to check the quality of the passes in that directory. I then manually inspected the results using nofib-compare etc, and tried to guess what was influencing the results and created a new batch of pass sequences to test.
Then, I started working on some more fancy infrastructure to automate the selection of passes using OpenTuner. That infrastructure was designed so that it could be used to tune GHC, but it is currently set-up to tune C/C++. It also uses some fancy stuff like optimization statistics to guide the tuner, so perhaps its more of a research prototype repository at this stage.
My recommendation would be to take a look at the work regarding the hand-crafted optimization passes and perhaps select one or two of those to integrate into GHC (or develop your own). I recall at least one or two passes that are not in the default pipeline, but are nonetheless useful for GHC. Particularly, running -merge-funcs prior to -inline. I can help you out with some of the pass sequence selection / clean-up if needed.
I heard from @baskaransri a few days ago via email and they said:
I've gotten the ghc-tuner repo running on an AWS instance, and it's benchmarking your round6 configs (handopt/aggressive/regularinline) against llvm's O2 and O3.
They weren't sure whether to use gitlab or email with their question. For future updates / questions gitlab is fine. I definitely get the mails if I'm tagged in a post. :)
Indeed, @baskaransri, do try to keep communications on the issue tracker if possible. It makes it much easier to follow progress if things are all in one, publicly-accessible place.
I've run Kavon's round6 handwritten opt configurations against nofib with mode=slow on an AWS c5.large instance. The nofib analyses are here for ghc -O1 and here for ghc -O2. The gains/losses are highly variable, with (for example) gen_regexp regressing in runtime by 10%. On average, there doesn't seem to be a major difference between O2 and kavon's opt sequences.
In these runs I've commented out the latesimplifycfg as it's been replaced in LLVM 10, will look at replacing it.
Next steps:
The FFT and simalg numbers are suspicious, I'll try rerunning them.
I'll see about replacing latesimplyifycfg and see if that makes a difference. Kavon has suggested that not all the analysis passes need to be specified - am I correct in understanding that you think some of these might be redundant/duplicated from specifying them explcitly, @kavon?
Where's the best place to read about the opts/passes? The online docs don't seem to include tbaa, for example. I've been looking at the LLVM code comments.
I think the best place to read about the opts/passes and the reasoning behind the default Ox pipelines is the source code and comments.
Specifically, the (current) default old pass manager pipeline is defined starting in PassManagerBuilder::populateModulePassManager [1]. Technically clang also uses PassManagerBuilder::populateFunctionPassManager and runs those passes on each function prior to running the module passes, probably to clean-up the totally unoptimized code. That shouldn't be necessary for GHC since it emits code that has already been optimized, though SROA is perhaps worth running at some point.
To find passes that are in LLVM but not part of the default pipeline, I think you have to dig through the source code under llvm/lib/Transforms and llvm/include/llvm/Transforms to learn more about the passes listed under --help-hidden, since the online documentation about what passes are available and what they do is definitely out-of-date.
As a side note, the "experimental" new pass manager infrastructure's pipeline is a port of the old pipeline and can be found under PassBuilder::buildPerModuleDefaultPipeline. There are at most one or two passes missing because they have not been upgraded yet, but it's basically complete.
It's important to note that it's not really "experimental" anymore. There have been recent calls to make it the default in clang for example [4], and just in the past week there's been commits to opt that make it even easier to use the new PM.
am I correct in understanding that you think some of these might be redundant/duplicated from specifying them explcitly, @kavon?
Yes. I believe it is not necessary to specify non-alias analysis passes such as:
-basiccg-domtree-demanded-bits... etc
because any passes that require that information will force those analyses to be run.
Alias analysis a bit different, since it seems there is always an aliasing-query infrastructure available during optimization. Each aliasing analysis type that is enabled is asked in a specified order (either fixed or user-specified) until a solid yes/no answer is provided, unless nobody can answer the query definitively.
Thus, for the old pass manager, I believe it is only necessary to specify the alias-analysis types once in the pipeline. So, analyses like -tbaa and -globals-aa could all be moved to the front of the pass list and just specified once. I don't know if the order of those, relative to each other, matters.
opt by default uses the old pass manager, but the much more convenient and clear --aa-pipeline=<string> option is only effective if you tell opt to use the new pass manager. You use the new pass manager through opt by specifying all of the passes via the -passes=<string> option [3].
For example, you can specify a default pipeline like -O2 with the new pass manager via -passes="default<O2>" and tack-on other passes around it by separating them with commas like -passes="loop-vectorize,simplifycfg" etc. There's also the option of injecting passes into the default pipeline via extension points using the -passes-ep-*=<string> flag [2], which you can't do from opt with the old pass manager.
Frankly, I think using extension-points + a default pipeline is the right approach to creating a well-balanced optimization pipeline for GHC, since I think there are mostly just a few passes that are missing / disabled that GHC could make use of. In my experience doing each of those "rounds" in the ghc-tuner and tracking down regressions like you've already spotted in gen_regexps, the main culprit has been that I omitted a pass that the program heavily relies on, yet including that pass doesn't hurt any of the other programs. The different extension-points are documented here: https://github.com/llvm/llvm-project/blob/master/llvm/include/llvm/Passes/PassBuilder.h#L530
Also, I think the new pass manager has the benefit of yielding better compile times since it is much more efficient about whether there is a need to re-run individual analyses. This was one of the main motivations of creating a new pass manager. Reduced compile-times in itself could be a big win for the users of GHC. So it's worth looking into using the new pass manager while doing this tuning process.
The new pass manager seems fruitful - a quick manual test seems to show '--passes="default"' is around 30% faster than the old-style '-O2' on some representative files (though does not produce identical output). I'm currently getting the GHC benchmarks working with the new pass manager to make sure there aren't any perf degradations.
My current issue is that trying to combine pass managers means opt complains and gives up. It looks like GHC gives old-style opt-passes by default which we override in llvm-passes (and via opt-lo?); if you want to switch to the new pass manager, this will need to change, and it'll also be incompatible with previous ways in which we've called opt-lo. Is this fine?
I'll admit that I'm not familiar with the new pass manager. Nevertheless, 30% is quite promising.
My current issue is that trying to combine pass managers means opt complains and gives up. It looks like GHC gives old-style opt-passes by default which we override in llvm-passes (and via opt-lo?); if you want to switch to the new pass manager, this will need to change, and it'll also be incompatible with previous ways in which we've called opt-lo. Is this fine?
No objection from my end as long as the result is sound and faster than the status quo. Will this mean that we need to completely relinquish control over the passes run to LLVM? On the bright side, I suppose this would render this ticket moot. On the other hand, it would mean that we may be stuck with optimisation passes that don't pull their weight.
Apologies, things have been busy with my PhD starting, and are likely to continue to be slow.
We should be able to use a custom set of passes with the new pass manager.
I've dumped a bunch of the llvm bitcode files from compiling GHC as a test set; with the old pass manager and -O2, they compile in 14 mins on my test machine, with the new --passes="default<O2>" they compile in 10 minutes. I'm struggling to replicate this in GHC proper.
What would be the best way to log exactly how opt is being called by GHC?
-v3 and above should have ghc print the commands it executes. You might want -keep-llvm-files and -keep-tmp-files to be able to re-execute those commands.
I'm trying to manually override the opt flags in the code, and am a little confused.
In compiler/GHC/Driver/Pipeline.hs, I'm trying to directly override optFlag in the LlvmOpt phase with garbage to make sure it works like I think it does (optFlag = [GHC.SysTools.Option "--BLAHBLAHBLAH"]).
Despite this, I can build up to stage3 with the perf-llvm flavour, but the compiler produced will break if I try to use it to compile with -fllvm. I am surprised I can build stage3. Is this expected?