Skip to content

Refactor and document unit management code

Sylvain Henry requested to merge hsyl20/ghc:hsyl20-unitid into master

Over the years the unit management code has been modified a lot to keep up with changes in Cabal (e.g. support for several library components in the same package), to integrate BackPack, etc. I found it very hard to understand as the terminology wasn't consistent, was referring to past concepts, etc.

The terminology is now explained as clearly as I could in the Note "About Units" in GHC.Unit and the code is refactored to reflect it. See the commits messages for more details.

-- Note [About Units]
-- ~~~~~~~~~~~~~~~~~~
--
-- Haskell users are used to manipulate Cabal packages. These packages are
-- identified by:
--    - a package name :: String
--    - a package version :: Version
--    - (a revision number, when they are registered on Hackage)
--
-- Cabal packages may contain several components (libraries, programs,
-- testsuites). In GHC we are mostly interested in libraries because those are
-- the components that can be depended upon by other components. Components in a
-- package are identified by their component name. Historically only one library
-- component was allowed per package, hence it didn't need a name. For this
-- reason, component name may be empty for one library component in each
-- package:
--    - a component name :: Maybe String
--
-- UnitId
-- ------
--
-- Cabal libraries can be compiled in various ways (different compiler options
-- or Cabal flags, different dependencies, etc.), hence using package name,
-- package version and component name isn't enough to identify a built library.
-- We use another identifier called UnitId:
--
--   package name             \
--   package version          |                       ________
--   component name           | hash of all this ==> | UnitId |
--   Cabal flags              |                       --------
--   compiler options         |
--   dependencies' UnitId     /
--
-- Fortunately GHC doesn't have to generate these UnitId: they are provided by
-- external build tools (e.g. Cabal) with `-this-unit-id` command-line parameter.
--
-- UnitIds are important because they are used to generate internal names
-- (symbols, etc.).
--
-- Wired-in units
-- --------------
--
-- Certain libraries are known to the compiler, in that we know about certain
-- entities that reside in these libraries. The compiler needs to declare static
-- Modules and Names that refer to units built from these libraries.
--
-- Hence UnitIds of wired-in libraries are fixed. Instead of letting Cabal chose
-- the UnitId for these libraries, their .cabal file use the following stanza to
-- force it to a specific value:
--
--    ghc-options: -this-unit-id ghc-prim    -- taken from ghc-prim.cabal
--
-- The RTS also uses entities of wired-in units by directly referring to symbols
-- such as "base_GHCziIOziException_heapOverflow_closure" where the prefix is
-- the UnitId of "base" unit.
--
-- Unit databases
-- --------------
--
-- Units are stored in databases in order to be reused by other codes:
--
--    UnitKey ---> UnitInfo { exposed modules, package name, package version
--                            component name, various file paths,
--                            dependencies :: [UnitKey], etc. }
--
-- Because of the wired-in units described above, we can't exactly use UnitIds
-- as UnitKeys in the database: if we did this, we could only have a single unit
-- (compiled library) in the database for each wired-in library. As we want to
-- support databases containing several different units for the same wired-in
-- library, we do this:
--
--    * for non wired-in units:
--       * UnitId = UnitKey = Identifier (hash) computed by Cabal
--
--    * for wired-in units:
--       * UnitKey = Identifier computed by Cabal (just like for non wired-in units)
--       * UnitId  = unit-id specified with -this-unit-id command-line flag
--
-- We can expose several units to GHC via the `package-id <UnitKey>`
-- command-line parameter. We must use the UnitKeys of the units so that GHC can
-- find them in the database.
--
-- GHC then replaces the UnitKeys with UnitIds by taking into account wired-in
-- units: these units are detected thanks to their UnitInfo (especially their
-- package name).
--
-- For example, knowing that "base", "ghc-prim" and "rts" are wired-in packages,
-- the following dependency graph expressed with UnitKeys (as found in the
-- database) will be transformed into a similar graph expressed with UnitIds
-- (that are what matters for compilation):
--
--    UnitKeys
--    ~~~~~~~~                             ---> rts-1.0-hashABC <--
--                                         |                      |
--                                         |                      |
--    foo-2.0-hash123 --> base-4.1-hashXYZ ---> ghc-prim-0.5.3-hashABC
--
--    UnitIds
--    ~~~~~~~                              ---> rts <--
--                                         |          |
--                                         |          |
--    foo-2.0-hash123 --> base ---------------> ghc-prim
--
--
-- Module signatures / indefinite units / instantiated units
-- ---------------------------------------------------------
--
-- GHC distinguishes two kinds of units:
--
--    * definite: units for which every module has an associated code object
--    (i.e. real compiled code in a .o/.a/.so/.dll/...)
--
--    * indefinite: units for which some modules are replaced by module
--    signatures.
--
-- Module signatures are a kind of interface (similar to .hs-boot files). They
-- are used in place of some real code. GHC allows real modules from other
-- units to be used to fill these module holes. The process is called
-- "unit/module instantiation".
--
-- You can think of this as polymorphism at the module level: module signatures
-- give constraints on the "type" of module that can be used to fill the hole
-- (where "type" means types of the exported module entitites, etc.).
--
-- Module signatures contain enough information (datatypes, abstract types, type
-- synonyms, classes, etc.) to typecheck modules depending on them but not
-- enough to compile them. As such, indefinite units found in databases only
-- provide module interfaces (the .hi ones this time), not object code.
--
-- To distinguish between indefinite and finite unit ids at the type level, we
-- respectively use 'IndefUnitId' and 'DefUnitId' datatypes that are basically
-- wrappers over 'UnitId'.
--
-- Unit instantiation
-- ------------------
--
-- Indefinite units can be instantiated with modules from other units. The
-- instantiating units can also be instantiated themselves (if there are
-- indefinite) and so on. The 'Unit' datatype represents a unit which may have
-- been instantiated:
--
--    data Unit = RealUnit DefUnitId
--              | VirtUnit InstantiatedUnit
--
-- 'InstantiatedUnit' has two interesting fields:
--
--    * instUnitInstanceOf :: IndefUnitId
--       -- ^ the indefinite unit that is instantiated
--
--    * instUnitInsts :: [(ModuleName,(Unit,ModuleName)]
--       -- ^ a list of instantiations, where an instantiation is:
--            (module hole name, (instantiating unit, instantiating module name))
--
-- A 'Unit' may be indefinite or definite, it depends on whether some holes
-- remain in the instantiated unit OR in the instantiating units (recursively).
--
-- Pretty-printing UnitId
-- ----------------------
--
-- GHC mostly deals with UnitIds which are some opaque strings. We could display
-- them when we pretty-print a module origin, a name, etc. But it wouldn't be
-- very friendly to the user because of the hash they usually contain. E.g.
--
--    foo-4.18.1:thelib-XYZsomeUglyHashABC
--
-- Instead when we want to pretty-print a 'UnitId' we query the database to
-- get the 'UnitInfo' and print something nicer to the user:
--
--    foo-4.18.1:thelib
--
-- We do the same for wired-in units.
--
-- Currently (2020-04-06), we don't thread the database into every function that
-- pretty-prints a Name/Module/Unit. Instead querying the database is delayed
-- until the `SDoc` is transformed into a `Doc` using the database that is
-- active at this point in time. This is an issue because we want to be able to
-- unload units from the database and we also want to support several
-- independent databases loaded at the same time (see #14335). The alternatives
-- we have are:
--
--    * threading the database into every function that pretty-prints a UnitId
--    for the user (directly or indirectly).
--
--    * storing enough info to correctly display a UnitId into the UnitId
--    datatype itself. This is done in the IndefUnitId wrapper (see
--    'UnitPprInfo' datatype) but not for every 'UnitId'. Statically defined
--    'UnitId' for wired-in units would have empty UnitPprInfo so we need to
--    find some places to update them if we want to display wired-in UnitId
--    correctly. This leads to a solution similar to the first one above.

If you want to get an idea of how it was before this patch, read Note "The identifier lexicon" in HEAD.

I plan to do more work to fix the issues mentioned at the end of the "About Units" note and to avoid mixing up UnitKey/UnitId as we do now, but this patch is already big so it's left as future work for now.

Edited by Sylvain Henry

Merge request reports