Use source file hashes as well as mtimes when determining whether a file needs to be recompiled
Opening a new ticket based on discussion in #16495 (closed) on @bgamari's suggestion, since #16495 (closed) is slightly different.
Motivation
Currently, if a source file has a modified time which is newer than the corresponding object file, it is considered to need recompilation:
stableObject m =
all stableObject (imports m)
&& old linkable does not exist, or is == on-disk .o
&& date(on-disk .o) > date(.hs)
There are two minor problems with this approach:
- In practice, it does not appear to always be safe to rely on source files' timestamps. For example, moving source files around using tools like
rsync
ortar
with flags to preserve mtimes can cause modified times to go backwards. - If a source file has a newer mtime but its contents are untouched, GHC will unnecessarily rebuild the file. This can slow CI jobs down quite significantly; in the context of CI it is difficult to make effective use of caching.
Proposal
Record the timestamp and hash of the source file in the module interface after a successful compile of that module, and change stableObject
as follows:
stableObject m =
all stableObject (imports m)
&& old linkable does not exist, or is == on-disk .o
&& ( date(on-disk .hs) == date recorded in .hi
|| hash(on-disk .hs) == hash recorded in .hi
)
That is, if the mtime recorded in the .hi matches the source file's mtime exactly, assume it is unchanged. Otherwise, hash the source file and compare with the hash recorded in the .hi file. Note that if the timestamps don't match but the hashes do, no rebuild is required but we should still update the recorded source timestamp in the .hi file, in order to permit skipping hashing on subsequent rebuilds.
This does unfortunately require opening the .hi file during the stability check, although it may be possible to restructure the information in the .hi file so that we only need to read a small portion of it at this stage, or to read it lazily? I haven't completely understood the tradeoffs here yet.
It is possible that hashing is cheap enough that checking the source file modification time is not worth the bother. It will be easy to verify whether or not this is the case after implementing this proposal. If it turns out that the extra complexity of checking modification times as well as hashes isn't paying for itself, we might consider getting rid of modification time checks entirely.