Introduce tag inference just before codegen.
Motivation
Codegen wise GHC treats ALL lifted references the same. Independent if they represent:
- a case binder
- a value, having been been evaluated already.
- a reference which can be proven to be tagged.
The result is that for each of these cases we need to first check for the presence of a tag. If no tag is present (only possible in case two!) we further need to enter the closure to get a tagged reference directly to the value.
The end effect is wasted work checking for tags we can prove to be present, and wasted code size by generating branches which will never be taken.
Proposal
The idea for an solution I am working one is rather simple. Right before we translate Stg to Cmm we run a pass on the STG AST which tries to determine whtther or not a given reference will be tagged at runtime.
For cases where we determine the reference to always be tagged we can skip checking for the presence of a tag and don't have to emit enter code.
For cases where we determine the reference to be always untagged we can directly enter a closure as well.
It seems rather simple, at least case binders and contents of strict fields are guaranteed to have been evaluated, so it seems straight forward to assume that they are also tagged.
Strict fields
An important source of information about "has this been evaluated yet" are strict fields.
However while their contents are guaranteed to have been evaluated when casing on the constructor they are NOT guaranteed to be tagged.
See #15155 (closed), #14677 (closed), #14626 (closed). This fact also seems to block #14373 (closed).
They can instead contain indirections (to evaluated values) or absentError expressions generated by WW. This happens very rarely, but once is enough for a program to crash so this makes them useless for the purpose of inferring taggedness of references.
The strict field invariant
Instead of accepting this fact we add a new invariant to GHC generated code: References to values stored in strict fields must be tagged.
The principle is again rather simple. If we would allocate a constructor with potentially untagged references in strict fields, we enter the referenced closure and store instead the result we get back. This is safe since again, the references ultimately already lead to an evaluated result. It's only required to get rid of the indirection and/or to tag the reference.
Further if we assume a strict field is allocated once and read once we do no additional work. While we will enter it before allocation we won't need to enter it when reading the strict field.
If we already have a tagged reference when allocating the constructor, or when it's allocated once but accessed often we remove redundant work.
Potential for regressions.
In order for this to be worse than the status quo it's required that:
- We allocate a constructor with strict fields.
- The constructors strict fields are never accessed.
- We can't prove that the references stored in the strict fields are tagged.
A special case of the above are top level constructors currently containing untagged references in strict fields. These need to be turned into thunks in order to allow tagging of the references before we return the constructor.
In practice this quite rare and does not seem to meaningfully impact performance.
The tricky parts
The tricky part here is that the analysis has to be good enough to ensure the potential regression case is rare. However I have a working prototype which successfully does so, so this does not seem to be an issue.
Impact
Performance
The biggest impact this has is when traversing data structures with strict spines.
Currently these tend to check for a lot of (always present) tags.
These checks can be completely for these traversals with this approach.
Things like membership tests for Data.Set are speed up by >10% in their benchmark suite in preliminary benchmarks.
There are also benefits in certain cases where a function will return tagged results "hidden" behind fields. eg given the following function:
func :: Int -> Either (Int,Double) a
func x =
let !foo = f x
!bar = g x
in Left (foo, bar)
A consumer of the result accessing the Double value will not need to check for a tag on the Tuple
or Double
value!
Other tickets
This will unblock #14373 (closed) which seems to depend on strict fields containing tagged references.
This will eliminate the possibility of untagged pointers being stored in strict fields: #15155 (closed).
This will eliminate many cases where we currently generate enter code for scrutinised values: #14626 (closed)
This will NOT fix #14677 (closed) , it will only work around it by entering untagged closures to make sure we only use the tagged result. But fixing #14677 (closed) will benefit this a work a lot so I will likely fix this as well in the course of implementing this.
Numbers
Even though this is mostly targeted at code making heavy use of strict fields, even nofib seems to benefit decently:
Click to expand
--------------------------------------------------------------------------------
Program Size Allocs Instrs Reads Writes
--------------------------------------------------------------------------------
CS +0.2% 0.0% -0.0% -0.0% 0.0%
CSD +0.2% 0.0% -0.0% -0.0% -0.0%
FS +0.2% 0.0% -0.0% -0.0% -0.0%
S +0.2% 0.0% +0.0% +0.0% +0.0%
VS +0.2% 0.0% -0.0% -0.0% -0.0%
VSD +0.2% 0.0% -0.0% -0.0% -0.0%
VSM +0.2% 0.0% -0.0% -0.0% -0.0%
anna +0.2% 0.0% -0.0% -0.0% -0.0%
ansi +0.2% 0.0% -0.0% -0.0% -0.0%
atom +0.2% 0.0% -0.0% -0.0% -0.1%
awards +0.2% 0.0% -0.0% -0.0% -0.0%
banner +0.2% 0.0% -0.0% -0.0% -0.0%
bernouilli +0.2% 0.0% -2.0% +0.0% -3.2%
binary-trees +0.2% 0.0% -0.0% -0.0% -0.0%
boyer +0.2% 0.0% -0.4% -0.1% -0.7%
boyer2 +0.2% 0.0% +0.1% +0.3% -0.3%
bspt +0.2% 0.0% -0.0% -0.0% -0.0%
cacheprof +0.2% 0.0% -0.0% +0.0% -0.1%
calendar +0.2% 0.0% -0.0% -0.0% -0.0%
cichelli +0.2% 0.0% -0.0% -0.0% -0.0%
circsim +0.2% 0.0% -0.0% -0.0% -0.0%
clausify +0.2% 0.0% -0.0% -0.0% -0.0%
comp_lab_zift +0.2% 0.0% -0.4% -0.3% -0.8%
compress +0.2% 0.0% -0.0% -0.0% -0.0%
compress2 +0.2% 0.0% -0.0% -0.0% -0.0%
constraints +0.2% 0.0% -0.0% -0.0% -0.0%
cryptarithm1 +0.2% 0.0% -0.0% -0.0% -0.0%
cryptarithm2 +0.2% 0.0% -0.0% -0.0% -0.0%
cse +0.2% 0.0% -0.1% -0.1% -0.1%
digits-of-e1 +0.2% 0.0% -3.6% +0.4% -5.3%
digits-of-e2 +0.2% 0.0% -2.7% +0.1% -3.4%
dom-lt -0.1% 0.0% +0.0% +0.3% -0.6%
eliza +0.2% 0.0% -0.0% -0.0% -0.0%
event +0.2% 0.0% -0.0% -0.0% -0.0%
exact-reals +0.2% 0.0% -3.0% +0.1% -4.7%
exp3_8 +0.2% 0.0% -0.0% -0.0% -0.0%
expert +0.2% 0.0% -0.0% -0.0% -0.0%
fannkuch-redux +0.2% 0.0% -0.0% -0.0% -0.0%
fasta +0.2% 0.0% -0.3% +0.3% +0.4%
fem +0.0% 0.0% -0.3% -0.2% -0.5%
fft +0.3% 0.0% -2.9% -2.0% -6.0%
fft2 +0.3% 0.0% -2.0% -1.5% -5.1%
fibheaps +0.2% 0.0% -0.0% -0.0% -0.0%
fish +0.2% 0.0% -0.0% -0.0% -0.0%
fluid +0.2% 0.0% -2.1% -1.4% -4.3%
fulsom +0.2% 0.0% -1.4% -0.6% -2.4%
gamteb +0.2% 0.0% -2.7% -0.3% -3.2%
gcd +0.2% 0.0% -4.0% +0.6% -5.1%
gen_regexps +0.2% 0.0% -0.0% -0.0% 0.0%
genfft +0.2% 0.0% -0.0% -0.0% -0.0%
gg +0.2% 0.0% -0.5% -0.1% -0.6%
grep +0.2% 0.0% -0.0% -0.0% -0.0%
hidden +0.2% 0.0% -0.9% -0.5% -1.5%
hpg +0.2% 0.0% -0.3% -0.1% -0.4%
ida +0.2% 0.0% -0.1% -0.1% -0.1%
infer +0.2% 0.0% -0.0% -0.0% +0.0%
integer +0.2% 0.0% -0.8% -0.0% -0.8%
integrate +0.2% 0.0% -0.0% -0.0% -0.0%
k-nucleotide +0.2% 0.0% -0.0% -0.0% -0.0%
kahan +0.2% 0.0% -0.0% -0.0% -0.0%
knights +0.2% 0.0% -0.0% -0.0% -0.0%
lambda +0.2% 0.0% -0.0% -0.0% -0.0%
last-piece +0.1% 0.0% -4.0% -0.2% -4.7%
lcss +0.2% 0.0% -0.1% -0.1% -0.2%
life +0.2% 0.0% -0.0% -0.0% -0.0%
lift +0.2% 0.0% -0.0% -0.0% -0.0%
linear +0.2% 0.0% -0.2% -0.0% -0.2%
listcompr +0.2% 0.0% -0.0% -0.0% -0.0%
listcopy +0.2% 0.0% -0.0% -0.0% -0.0%
maillist +0.2% 0.0% -0.0% -0.0% -0.0%
mandel +0.3% 0.0% -2.4% -2.4% -5.4%
mandel2 +0.2% 0.0% -0.0% -0.0% -0.0%
mate +0.2% 0.0% -0.0% -0.0% -0.0%
minimax +0.2% 0.0% -0.0% -0.0% -0.0%
mkhprog +0.2% 0.0% -0.0% -0.0% -0.0%
multiplier +0.2% 0.0% -0.0% -0.0% -0.0%
n-body +0.2% 0.0% -0.0% -0.0% -0.0%
nucleic2 +0.3% 0.0% +0.0% +0.0% +0.0%
para +0.2% 0.0% -0.0% -0.0% -0.0%
paraffins +0.2% 0.0% -0.9% -0.8% -1.2%
parser +0.2% 0.0% -0.0% +0.0% -0.0%
parstof +0.2% 0.0% -0.0% -0.0% -0.0%
pic +0.2% 0.0% -2.2% -1.2% -3.7%
pidigits +0.2% 0.0% -0.2% -0.0% -0.1%
power +0.2% +3.5% +0.7% +3.7% +1.9%
pretty +0.2% 0.0% -0.0% -0.0% -0.0%
primes +0.2% 0.0% -0.0% -0.0% -0.0%
primetest +0.2% 0.0% -1.0% -0.1% -1.7%
prolog +0.2% 0.0% -0.0% -0.0% -0.0%
puzzle +0.2% 0.0% +0.1% +1.2% -0.4%
queens +0.2% 0.0% -0.0% -0.0% -0.0%
reptile +0.2% 0.0% -0.1% -0.0% -0.1%
reverse-complem +0.2% 0.0% -0.0% -0.0% -0.0%
rewrite +0.2% 0.0% -0.0% -0.0% -0.0%
rfib +0.2% 0.0% -0.0% -0.0% -0.0%
rsa +0.2% 0.0% -1.1% -0.2% -1.9%
scc +0.2% 0.0% -0.0% -0.0% -0.0%
sched +0.2% 0.0% -0.0% -0.0% -0.0%
scs +0.1% +0.0% -0.7% +0.1% -0.9%
simple -0.1% 0.0% -2.3% -1.0% -4.2%
solid +0.2% 0.0% -1.8% -1.0% -3.2%
sorting +0.2% 0.0% -0.0% -0.0% -0.0%
spectral-norm +0.2% 0.0% -0.0% -0.0% -0.0%
sphere +0.2% 0.0% -0.2% -0.0% -0.4%
symalg +0.2% 0.0% -0.2% -0.1% -0.4%
tak +0.2% 0.0% -0.0% -0.0% -0.0%
transform +0.2% 0.0% +0.0% +0.0% -0.0%
treejoin +0.2% 0.0% -0.1% -0.1% -0.2%
typecheck +0.2% 0.0% -0.0% -0.0% -0.0%
veritas +0.2% 0.0% -0.0% -0.0% -0.1%
wang +0.2% 0.0% -0.0% -0.0% -0.0%
wave4main +0.2% 0.0% -12.2% -10.4% -22.9%
wheel-sieve1 +0.2% 0.0% -0.0% -0.0% -0.0%
wheel-sieve2 +0.2% 0.0% -0.0% -0.0% -0.0%
x2n1 +0.2% 0.0% -0.3% -0.0% -3.9%
--------------------------------------------------------------------------------
Min -0.1% 0.0% -12.2% -10.4% -22.9%
Max +0.3% +3.5% +0.7% +3.7% +1.9%
Geometric Mean +0.2% +0.0% -0.5% -0.2% -0.9%
For containers last I checked the overall runtime reduction for the benchmarks was in the order of 4%, but I haven't run them on the current iteration yet.
See also: https://docs.google.com/document/d/19r5Jzk-qfLfUXtQKd0_ZuIs5LoOk0Jg0TUeOa5OvW2c/edit#