Commit 6a2264d2 authored by Michal Terepeta's avatar Michal Terepeta Committed by Ben Gamari
Browse files

cmm/CmmLayoutStack: avoid generating unnecessary reloads



This tries to be more precise when generating reloads of local
registers in proc points. Previously we'd reload all local registers
that were live. But we used liveness information that assumed local
registers survive native calls. For the purpose of reloading registers
this is an overapproximation and might lead to generating huge amounts
of unnecessary reloads (in case there's another proc point before the
register is used).

This change takes the approach of moving the generation of reloads to
a second pass over the Cmm, which allows to recompute the liveness and
can use the knowledge that local registers do *not* survive calls.
This leads to generating only useful reloads. For an extreme example
where this helps a lot please see T3294. This should also fix #7198

Finally, this re-introduces the code to do Cmm rewriting using in
`Dataflow` module (with the difference that we know operate on a whole
block at a time).
Signed-off-by: Michal Terepeta's avatarMichal Terepeta <michal.terepeta@gmail.com>

Reviewers: austin, bgamari, simonmar

Reviewed By: simonmar

Subscribers: kavon, rwbarton, thomie

GHC Trac Issues: #7198

Differential Revision: https://phabricator.haskell.org/D3586
parent a9b62a3e
{-# LANGUAGE CPP, RecordWildCards, GADTs #-}
{-# LANGUAGE BangPatterns, CPP, RecordWildCards, GADTs #-}
module CmmLayoutStack (
cmmLayoutStack, setInfoTableStackMap
) where
......@@ -60,6 +60,11 @@ We want to do stack allocation so that as far as possible
The algorithm we use is a variant of linear-scan register allocation,
where the stack is our register file.
We proceed in two passes, see Note [Two pass approach] for why they are not easy
to merge into one.
Pass 1:
- First, we do a liveness analysis, which annotates every block with
the variables live on entry to the block.
......@@ -80,9 +85,10 @@ where the stack is our register file.
- Look up the StackMap for this block.
- If this block is a proc point (or a call continuation, if we
aren't splitting proc points), emit instructions to reload all
the live variables from the stack, according to the StackMap.
- If this block is a proc point (or a call continuation, if we aren't
splitting proc points), we need to reload all the live variables from the
stack - but this is done in Pass 2, which calculates more precise liveness
information (see description of Pass 2).
- Walk forwards through the instructions:
- At an assignment x = Sp[loc]
......@@ -119,10 +125,52 @@ where the stack is our register file.
an input. I hate cyclic programming, but it's just too convenient
sometimes.)
There are plenty of tricky details: update frames, proc points, return
addresses, foreign calls, and some ad-hoc optimisations that are
convenient to do here and effective in common cases. Comments in the
code below explain these.
There are plenty of tricky details: update frames, proc points, return
addresses, foreign calls, and some ad-hoc optimisations that are
convenient to do here and effective in common cases. Comments in the
code below explain these.
Pass 2:
- Calculate live registers, but taking into account that nothing is live at the
entry to a proc point.
- At each proc point and call continuation insert reloads of live registers from
the stack (they were saved by Pass 1).
Note [Two pass approach]
The main reason for Pass 2 is being able to insert only the reloads that are
needed and the fact that the two passes need different liveness information.
Let's consider an example:
.....
\ /
D <- proc point
/ \
E F
\ /
G <- proc point
|
X
Pass 1 needs liveness assuming that local variables are preserved across calls.
This is important because it needs to save any local registers to the stack
(e.g., if register a is used in block X, it must be saved before any native
call).
However, for Pass 2, where we want to reload registers from stack (in a proc
point), this is overly conservative and would lead us to generate reloads in D
for things used in X, even though we're going to generate reloads in G anyway
(since it's also a proc point).
So Pass 2 calculates liveness knowing that nothing is live at the entry to a
proc point. This means that in D we only need to reload things used in E or F.
This can be quite important, for an extreme example see testcase for #3294.
Merging the two passes is not trivial - Pass 2 is a backward rewrite and Pass 1
is a forward one. Furthermore, Pass 1 is creating code that uses local registers
(saving them before a call), which the liveness analysis for Pass 2 must see to
be correct.
-}
......@@ -201,9 +249,14 @@ cmmLayoutStack dflags procpoints entry_args
layout dflags procpoints liveness entry entry_args
rec_stackmaps rec_high_sp blocks
new_blocks' <- mapM (lowerSafeForeignCall dflags) new_blocks
blocks_with_reloads <-
insertReloadsAsNeeded dflags procpoints final_stackmaps entry new_blocks
new_blocks' <- mapM (lowerSafeForeignCall dflags) blocks_with_reloads
return (ofBlockList entry new_blocks', final_stackmaps)
-- -----------------------------------------------------------------------------
-- Pass 1
-- -----------------------------------------------------------------------------
layout :: DynFlags
-> LabelSet -- proc points
......@@ -249,31 +302,25 @@ layout dflags procpoints liveness entry entry_args final_stackmaps final_sp_high
-- assignments in this block
let stack1 = foldBlockNodesF (procMiddle acc_stackmaps) middle0 stack0
-- (b) Insert assignments to reload all the live variables if this
-- block is a proc point
let middle1 = if entry_lbl `setMember` procpoints
then foldr blockCons middle0 (insertReloads stack0)
else middle0
-- (c) Look at the last node and if we are making a call or
-- (b) Look at the last node and if we are making a call or
-- jumping to a proc point, we must save the live
-- variables, adjust Sp, and construct the StackMaps for
-- each of the successor blocks. See handleLastNode for
-- details.
(middle2, sp_off, last1, fixup_blocks, out)
(middle1, sp_off, last1, fixup_blocks, out)
<- handleLastNode dflags procpoints liveness cont_info
acc_stackmaps stack1 tscope middle0 last0
-- (d) Manifest Sp: run over the nodes in the block and replace
-- (c) Manifest Sp: run over the nodes in the block and replace
-- CmmStackSlot with CmmLoad from Sp with a concrete offset.
--
-- our block:
-- middle1 -- the original middle nodes
-- middle2 -- live variable saves from handleLastNode
-- middle0 -- the original middle nodes
-- middle1 -- live variable saves from handleLastNode
-- Sp = Sp + sp_off -- Sp adjustment goes here
-- last1 -- the last node
--
let middle_pre = blockToList $ foldl blockSnoc middle1 middle2
let middle_pre = blockToList $ foldl blockSnoc middle0 middle1
let final_blocks =
manifestSp dflags final_stackmaps stack0 sp0 final_sp_high
......@@ -834,7 +881,6 @@ manifestSp dflags stackmaps stack0 sp0 sp_high
fixup_blocks' = map (mapBlock3' (id, adj_post_sp, id)) fixup_blocks
getAreaOff :: LabelMap StackMap -> (Area -> StackLoc)
getAreaOff _ Old = 0
getAreaOff stackmaps (Young l) =
......@@ -849,7 +895,8 @@ maybeAddSpAdj dflags sp_off block = block `blockSnoc` adj
where
adj = CmmAssign spReg (cmmOffset dflags (CmmReg spReg) sp_off)
{-
{- Note [SP old/young offsets]
Sp(L) is the Sp offset on entry to block L relative to the base of the
OLD area.
......@@ -990,6 +1037,62 @@ stackMapToLiveness dflags StackMap{..} =
, isGcPtrType (localRegType r) ]
-- See Note [Unique Determinism and code generation]
-- -----------------------------------------------------------------------------
-- Pass 2
-- -----------------------------------------------------------------------------
insertReloadsAsNeeded
:: DynFlags
-> ProcPointSet
-> LabelMap StackMap
-> BlockId
-> [CmmBlock]
-> UniqSM [CmmBlock]
insertReloadsAsNeeded dflags procpoints final_stackmaps entry blocks = do
toBlockList . fst <$>
rewriteCmmBwd liveLattice rewriteCC (ofBlockList entry blocks) mapEmpty
where
rewriteCC :: RewriteFun CmmLocalLive
rewriteCC (BlockCC e_node middle0 x_node) fact_base0 = do
let entry_label = entryLabel e_node
stackmap = case mapLookup entry_label final_stackmaps of
Just sm -> sm
Nothing -> panic "insertReloadsAsNeeded: rewriteCC: stackmap"
-- Merge the liveness from successor blocks and analyse the last
-- node.
joined = gen_kill dflags x_node $!
joinOutFacts liveLattice x_node fact_base0
-- What is live at the start of middle0.
live_at_middle0 = foldNodesBwdOO (gen_kill dflags) middle0 joined
-- If this is a procpoint we need to add the reloads, but only if
-- they're actually live. Furthermore, nothing is live at the entry
-- to a proc point.
(middle1, live_with_reloads)
| entry_label `setMember` procpoints
= let reloads = insertReloads dflags stackmap live_at_middle0
in (foldr blockCons middle0 reloads, emptyRegSet)
| otherwise
= (middle0, live_at_middle0)
-- Final liveness for this block.
!fact_base2 = mapSingleton entry_label live_with_reloads
return (BlockCC e_node middle1 x_node, fact_base2)
insertReloads :: DynFlags -> StackMap -> CmmLocalLive -> [CmmNode O O]
insertReloads dflags stackmap live =
[ CmmAssign (CmmLocal reg)
-- This cmmOffset basically corresponds to manifesting
-- @CmmStackSlot Old sp_off@, see Note [SP old/young offsets]
(CmmLoad (cmmOffset dflags (CmmReg spReg) (sp_off - reg_off))
(localRegType reg))
| (reg, reg_off) <- stackSlotRegs stackmap
, reg `elemRegSet` live
]
where
sp_off = sm_sp stackmap
-- -----------------------------------------------------------------------------
-- Lowering safe foreign calls
......@@ -1133,14 +1236,6 @@ toWords :: DynFlags -> ByteOff -> WordOff
toWords dflags x = x `quot` wORD_SIZE dflags
insertReloads :: StackMap -> [CmmNode O O]
insertReloads stackmap =
[ CmmAssign (CmmLocal r) (CmmLoad (CmmStackSlot Old sp)
(localRegType r))
| (r,sp) <- stackSlotRegs stackmap
]
stackSlotRegs :: StackMap -> [(LocalReg, StackLoc)]
stackSlotRegs sm = nonDetEltsUFM (sm_regs sm)
-- See Note [Unique Determinism and code generation]
......@@ -21,16 +21,20 @@ module Hoopl.Dataflow
( C, O, Block
, lastNode, entryLabel
, foldNodesBwdOO
, DataflowLattice(..), OldFact(..), NewFact(..), JoinedFact(..), TransferFun
, foldRewriteNodesBwdOO
, DataflowLattice(..), OldFact(..), NewFact(..), JoinedFact(..)
, TransferFun, RewriteFun
, Fact, FactBase
, getFact, mkFactBase
, analyzeCmmFwd, analyzeCmmBwd
, rewriteCmmBwd
, changedIf
, joinOutFacts
)
where
import Cmm
import UniqSupply
import Data.Array
import Data.List
......@@ -71,6 +75,14 @@ data Direction = Fwd | Bwd
type TransferFun f = CmmBlock -> FactBase f -> FactBase f
-- | Function for rewrtiting and analysis combined. To be used with
-- @rewriteCmm@.
--
-- Currently set to work with @UniqSM@ monad, but we could probably abstract
-- that away (if we do that, we might want to specialize the fixpoint algorithms
-- to the particular monads through SPECIALIZE).
type RewriteFun f = CmmBlock -> FactBase f -> UniqSM (CmmBlock, FactBase f)
analyzeCmmBwd, analyzeCmmFwd
:: DataflowLattice f
-> TransferFun f
......@@ -134,6 +146,74 @@ fixpointAnalysis direction lattice do_block entries blockmap = loop start
(updateFact join dep_blocks) (todo1, fbase1) out_facts
in loop todo2 fbase2
rewriteCmmBwd
:: DataflowLattice f
-> RewriteFun f
-> CmmGraph
-> FactBase f
-> UniqSM (CmmGraph, FactBase f)
rewriteCmmBwd = rewriteCmm Bwd
rewriteCmm
:: Direction
-> DataflowLattice f
-> RewriteFun f
-> CmmGraph
-> FactBase f
-> UniqSM (CmmGraph, FactBase f)
rewriteCmm dir lattice rwFun cmmGraph initFact = do
let entry = g_entry cmmGraph
hooplGraph = g_graph cmmGraph
blockMap1 =
case hooplGraph of
GMany NothingO bm NothingO -> bm
entries = if mapNull initFact then [entry] else mapKeys initFact
(blockMap2, facts) <-
fixpointRewrite dir lattice rwFun entries blockMap1 initFact
return (cmmGraph {g_graph = GMany NothingO blockMap2 NothingO}, facts)
fixpointRewrite
:: forall f.
Direction
-> DataflowLattice f
-> RewriteFun f
-> [Label]
-> LabelMap CmmBlock
-> FactBase f
-> UniqSM (LabelMap CmmBlock, FactBase f)
fixpointRewrite dir lattice do_block entries blockmap = loop start blockmap
where
-- Sorting the blocks helps to minimize the number of times we need to
-- process blocks. For instance, for forward analysis we want to look at
-- blocks in reverse postorder. Also, see comments for sortBlocks.
blocks = sortBlocks dir entries blockmap
num_blocks = length blocks
block_arr = {-# SCC "block_arr_rewrite" #-}
listArray (0, num_blocks - 1) blocks
start = {-# SCC "start_rewrite" #-} [0 .. num_blocks - 1]
dep_blocks = {-# SCC "dep_blocks_rewrite" #-} mkDepBlocks dir blocks
join = fact_join lattice
loop
:: IntHeap -- ^ Worklist, i.e., blocks to process
-> LabelMap CmmBlock -- ^ Rewritten blocks.
-> FactBase f -- ^ Current facts.
-> UniqSM (LabelMap CmmBlock, FactBase f)
loop [] !blocks1 !fbase1 = return (blocks1, fbase1)
loop (index : todo1) !blocks1 !fbase1 = do
-- Note that we use the *original* block here. This is important.
-- We're optimistically rewriting blocks even before reaching the fixed
-- point, which means that the rewrite might be incorrect. So if the
-- facts change, we need to rewrite the original block again (taking
-- into account the new facts).
let block = block_arr ! index
(new_block, out_facts) <- {-# SCC "do_block_rewrite" #-}
do_block block fbase1
let blocks2 = mapInsert (entryLabel new_block) new_block blocks1
(todo2, fbase2) = {-# SCC "mapFoldWithKey_rewrite" #-}
mapFoldWithKey
(updateFact join dep_blocks) (todo1, fbase1) out_facts
loop todo2 blocks2 fbase2
{-
......@@ -317,6 +397,39 @@ foldNodesBwdOO funOO = go
go BNil f = f
{-# INLINABLE foldNodesBwdOO #-}
-- | Folds backward over all the nodes of an open-open block and allows
-- rewriting them. The accumulator is both the block of nodes and @f@ (usually
-- dataflow facts).
-- Strict in both accumulated parts.
foldRewriteNodesBwdOO
:: forall f.
(CmmNode O O -> f -> UniqSM (Block CmmNode O O, f))
-> Block CmmNode O O
-> f
-> UniqSM (Block CmmNode O O, f)
foldRewriteNodesBwdOO rewriteOO initBlock initFacts = go initBlock initFacts
where
go (BCons node1 block1) !fact1 = (rewriteOO node1 `comp` go block1) fact1
go (BSnoc block1 node1) !fact1 = (go block1 `comp` rewriteOO node1) fact1
go (BCat blockA1 blockB1) !fact1 = (go blockA1 `comp` go blockB1) fact1
go (BMiddle node) !fact1 = rewriteOO node fact1
go BNil !fact = return (BNil, fact)
comp rew1 rew2 = \f1 -> do
(b, f2) <- rew2 f1
(a, !f3) <- rew1 f2
let !c = joinBlocksOO a b
return (c, f3)
{-# INLINE comp #-}
{-# INLINABLE foldRewriteNodesBwdOO #-}
joinBlocksOO :: Block n O O -> Block n O O -> Block n O O
joinBlocksOO BNil b = b
joinBlocksOO b BNil = b
joinBlocksOO (BMiddle n) b = blockCons n b
joinBlocksOO b (BMiddle n) = blockSnoc b n
joinBlocksOO b1 b2 = BCat b1 b2
-- -----------------------------------------------------------------------------
-- a Heap of Int
......
......@@ -167,7 +167,7 @@ test('T3294',
# 2015-07-11 43196344 (x86/Linux, 64-bit machine) use +RTS -G1
# 2016-04-06 28686588 (x86/Linux, 64-bit machine)
(wordsize(64), 63131248, 20)]),
(wordsize(64), 34050960, 20)]),
# prev: 25753192 (amd64/Linux)
# 29/08/2012: 37724352 (amd64/Linux)
# (increase due to new codegen, see #7198)
......@@ -186,6 +186,7 @@ test('T3294',
# 2016-07-11: 54609256 (Windows) before fix for #12227
# 2016-07-11: 52992688 (Windows) after fix for #12227
# 2017-02-17: 63131248 (amd64/Linux) Type indexed Typeable
# 2017-05-14: 34050960 (amd64/Linux) Two-pass CmmLayoutStack
compiler_stats_num_field('bytes allocated',
[(wordsize(32), 1377050640, 5),
......@@ -195,7 +196,7 @@ test('T3294',
# 2013-11-13: 1478325844 (x86/Windows, 64bit machine)
# 2014-01-12: 1565185140 (x86/Linux)
# 2013-04-04: 1377050640 (x86/Windows, 64bit machine)
(wordsize(64), 2758641264, 5)]),
(wordsize(64), 2253557280, 5)]),
# old: 1357587088 (amd64/Linux)
# 29/08/2012: 2961778696 (amd64/Linux)
# (^ increase due to new codegen, see #7198)
......@@ -209,6 +210,7 @@ test('T3294',
# 2016-07-11: 2664479936 (Windows) before fix for #12227
# 2016-07-11: 2739731144 (Windows) after fix for #12227 (ignoring)
# 2017-02-17: 2758641264 (amd64/Linux) (Type indexed Typeable)
# 2017-05-14: 2253557280 (amd64/Linux) Two-pass CmmLayoutStack
conf_3294,
# Use `+RTS -G1` for more stable residency measurements. Note [residency].
......@@ -538,7 +540,7 @@ test('T5321Fun',
# 2016-04-06: 279922360 x86/Linux
# 2017-03-24: 244387620 x86/Linux (64-bit machine)
(wordsize(64), 488295304, 5)])
(wordsize(64), 449577856, 5)])
# prev: 585521080
# 2012-08-29: 713385808 # (increase due to new codegen)
# 2013-05-15: 628341952 # (reason for decrease unknown)
......@@ -564,6 +566,7 @@ test('T5321Fun',
# 2017-01-31: 498135752 # Join points (#12988)
# 2017-02-23: 524706256 # Type-indexed Typeable? (on Darwin)
# 2017-02-25: 488295304 # Early inlining patch
# 2017-05-14: 449577856 # (amd64/Linxu) Two-pass CmmLayoutStack
],
compile,[''])
......@@ -578,7 +581,7 @@ test('T5321FD',
# (due to better optCoercion, 5e7406d9, #9233)
# 2016-04-06: 250757460 (x86/Linux)
(wordsize(64), 477840432, 10)])
(wordsize(64), 415136648, 10)])
# prev: 418306336
# 29/08/2012: 492905640
# (increase due to new codegen)
......@@ -599,6 +602,7 @@ test('T5321FD',
# D757: emit Typeable instances at site of type definition
# 2016-07-16: 477840432
# Optimize handling of built-in OccNames
# 2017-05-14: 415136648 (amd64/Linux) Two-pass CmmLayoutStack
],
compile,[''])
......@@ -1086,13 +1090,14 @@ test('T13056',
test('T12707',
[ compiler_stats_num_field('bytes allocated',
[(wordsize(64), 1231809592, 5),
[(wordsize(64), 1163821528, 5),
# initial: 1271577192
# 2017-01-22: 1348865648 Allow top-level strings in Core
# 2017-01-31: 1280336112 Join points (#12988)
# 2017-02-11: 1310037632 Check local family instances vs imports
# 2017-02-23: 1386110512 Type-indexed Typeable? (on Darwin)
# 2017-03-02: 1231809592 Drift from recent simplifier improvements
# 2017-05-14: 1163821528 (amd64/Linux) Two-pass CmmLayoutStack
]),
],
compile,
......@@ -1100,7 +1105,9 @@ test('T12707',
test('T13379',
[ compiler_stats_num_field('bytes allocated',
[(wordsize(64), 411597856, 10),
[(platform('x86_64-apple-darwin'), 453166912, 10),
# 453166912: add osx-specific after two-pass CmmLayoutStack
(wordsize(64), 411597856, 10),
# initial: 411597856
# widen window to 10%, Darwin had 449080520, a 9.1% difference
]),
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment