Deserialising `mi_extra_decls` introduces a lot of duplication
Summary
When we deserialise mi_extra_decls
from disk (after compiling with -fbyte-code-and-object-code
for example), we introduce many duplications of the same IfaceTyCon
. A two-level census with ghc-debug on the agda code base shows that we have 8_654_045
instances IfaceTyCon
, or 8.5 million alive.
However, when we look at the actual IfaceTyCon
s, we can see that a lot of them are duplicates:
#ConstrName,#NamePtr,#Element,#UniqueNumberOfElements,#`IfaceTyConInfo`-witnesses
IfaceTyCon,0x77a81438bcb0,351211,1,[0x77a8170abd98]
IfaceTyCon,0x77a81462b308,406823,2,[0x77a8170abd98,0x77a8170abed8]
IfaceTyCon,0x77a814388458,553151,1,[0x77a8170abed8]
IfaceTyCon,0x77a814629180,1685764,1,[0x77a8170abed8]
What this essentially means is that we have 1685764
instances of the form IfaceTyCon 0x77a814629180 0x77a8170abed8
on the heap alive.
In fact, there seem to be at most 6000 unique IfaceTyCon
s (see the attached data).
We know that we have roughly 200MB
of IfaceTyCon
s alive:
ghc-9.9-inplace:GHC.Iface.Type:IfaceTyCon[ghc-9.9-inplace:GHC.Types.Name:Name,THUNK_1_0]:207697080:8654045:24:24.0
Full data: uniqueIfaceTyCon_sorted.txt
Solutions
Reducing the duplication can be achieved in a number of ways.
One idea is to introduce the deduplication logic to the serialisation of the ModIface
itself. Similarly to how we deduplicate Name
and FastString
, we can think of a IfaceTyCon
deduplication table. As we don't know yet what data we might want to deduplicate in the future, it would make sense to refactor the serialisation logic of ModIface
to remove the special logic for Name
and FastString
deduplication and replace it with a generic deduplication logic approach. This would allow us to change what we deduplicate without having to rewrite everything whenever we come up with something else to deduplicate. As a positive side effects, this should also noticeably reduce the size of the ModIface on disk.