Sorting out interface file hash semantics
While looking at #10424 I stumbled into the fact that I really don't understand the semantics of the hashes included in interface files (known as the "interface hash" and "ABI hash", respectively). Not only is there no comprehensive Note summarising the story but the definitions as they stand are quite counter-intuitive.
Currently, these two hashes are defined as follows (excerpting from
GHC.Iface.Recomp
):
the ABI hash depends on:
- decls
- export list
- orphans
- deprecations
- flag abi hash
The interface hash depends on:
- the ABI hash, plus
- the source file hash,
- the module level annotations,
- usages
- deps (home and external packages, dependent files)
- hpc
To me these definitions seem very ad-hoc and counterintuitive. Specifically:
- the "ABI hash" captures things (e.g. fixities) that don't fit in any sensible definition of "application binary interface"
- the "interface hash" captures things that aren't really in the module's interface (e.g. the source file hash)
I have reflected on this a bit and I believe that the intended semantics would be clearer with the following renaming:
- what we call the "interface hash" should rather be called the input hash (since it captures information about the inputs to compilation)
- what we call the "ABI hash" would be more accurately called the interface hash (since it captures information about the final interface exposed by compiled module)
Here is my attempt at formulating a Note describing this proposed story:
Note [Interface hashes and recompilation]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Interface files include two distinct hashes:
* the input hash (formerly known as the "interface hash") captures all of the
inputs involved in compilation. This includes:
- the hash of the source file
- the names of any files added with `addDependentFile`
- the interface hashes of all imported modules
- the implementation hashes of any plugin modules' object code
- if TemplateHaskell is enabled: the implementations hashes of all imported modules' object code
- the compiler flags
* the interface hash (formerly known as the "ABI hash") captures the interface
exposed by the resulting compiled module. This includes:
- the full names, fixities, types, and unfoldings of all top-level bindings
- the names and kinds of any types exposed by the module
- the names and representations of all data constructors defined by the module
- the rewrite rules defined by the module
- the names, method signatures, default method definitions of any class methods
- the heads of any orphan instances defined by the module
- the interface hashes of any reexported declarations from other modules
- the module-level annotations
- any compiler flags that affect ABI (e.g. `-prof`)
During recompilation checking the input hash is compared against the input hash
recorded in the interface file to determine whether any of the inputs have
changed in a way that necessitates recompilation.
In the discussion above, the "implementation hash" of a module refers to some hash
which captures its semantics. This could be implemented as either (in order of
most conservative to most precise):
* the module's input hash
* a hash of its object code
* a hash of its full Core AST (including the right-hand sides of all bindings)
We currently use the object code hash for this role.
I believe this would be a self-consistent story and a strict improvement over what we have today in comprehensibility.