.. _profiling:
Profiling
=========
.. index::
single: profiling
single: cost-centre profiling
single: -p; RTS option
GHC comes with a time and space profiling system, so that you can answer
questions like "why is my program so slow?", or "why is my program using
so much memory?". We'll start by describing how to do time profiling.
Time profiling a program is a three-step process:
1. Re-compile your program for profiling with the :ghc-flag:`-prof` option, and
probably one of the options for adding automatic annotations:
:ghc-flag:`-fprof-late` is the recommended option.
2. Having compiled the program for profiling, you now need to run it to
generate the profile. For example, a simple time profile can be
generated by running the program with ``+RTS -p`` (see :rts-flag:`-p`), which
generates a file named :file:`{prog}.prof` where ⟨prog⟩ is the name of your
program (without the ``.exe`` extension, if you are on Windows).
There are many different kinds of profile that can be generated,
selected by different RTS options. We will be describing the various
kinds of profile throughout the rest of this chapter. Some profiles
require further processing using additional tools after running the
program.
3. Examine the generated profiling information, use the information to
optimise your program, and repeat as necessary.
The time profiler measures the CPU time taken by the Haskell code in your application.
In particular time taken by safe foreign calls is not tracked by the profiler (see :ref:`prof-foreign-calls`).
.. _cost-centres:
Cost centres and cost-centre stacks
-----------------------------------
GHC's profiling system assigns costs to cost centres. A cost is simply
the time or space (memory) required to evaluate an expression. Cost
centres are program annotations around expressions; all costs incurred
by the annotated expression are assigned to the enclosing cost centre.
Furthermore, GHC will remember the stack of enclosing cost centres for
any given expression at run-time and generate a call-tree of cost
attributions.
Let's take a look at an example: ::
main = print (fib 30)
fib n = if n < 2 then 1 else fib (n-1) + fib (n-2)
Compile and run this program as follows:
.. code-block:: none
$ ghc -prof -fprof-auto -rtsopts Main.hs
$ ./Main +RTS -p
121393
$
When a GHC-compiled program is run with the :rts-flag:`-p` RTS option, it
generates a file called :file:`prog.prof`. In this case, the file will contain
something like this:
.. code-block:: none
Wed Oct 12 16:14 2011 Time and Allocation Profiling Report (Final)
Main +RTS -p -RTS
total time = 0.68 secs (34 ticks @ 20 ms)
total alloc = 204,677,844 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
fib Main 100.0 100.0
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 102 0 0.0 0.0 100.0 100.0
CAF GHC.IO.Handle.FD 128 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 120 0 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 110 0 0.0 0.0 0.0 0.0
CAF Main 108 0 0.0 0.0 100.0 100.0
main Main 204 1 0.0 0.0 100.0 100.0
fib Main 205 2692537 100.0 100.0 100.0 100.0
The first part of the file gives the program name and options, and the
total time and total memory allocation measured during the run of the
program (note that the total memory allocation figure isn't the same as
the amount of *live* memory needed by the program at any one time; the
latter can be determined using heap profiling, which we will describe
later in :ref:`prof-heap`).
The second part of the file is a break-down by cost centre of the most
costly functions in the program. In this case, there was only one
significant function in the program, namely ``fib``, and it was
responsible for 100% of both the time and allocation costs of the
program.
The third and final section of the file gives a profile break-down by
cost-centre stack. This is roughly a call-tree profile of the program.
In the example above, it is clear that the costly call to ``fib`` came
from ``main``.
The time and allocation incurred by a given part of the program is
displayed in two ways: “individual”, which are the costs incurred by the
code covered by this cost centre stack alone, and “inherited”, which
includes the costs incurred by all the children of this node.
The usefulness of cost-centre stacks is better demonstrated by modifying
the example slightly: ::
main = print (f 30 + g 30)
where
f n = fib n
g n = fib (n `div` 2)
fib n = if n < 2 then 1 else fib (n-1) + fib (n-2)
Compile and run this program as before, and take a look at the new
profiling results:
.. code-block:: none
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 102 0 0.0 0.0 100.0 100.0
CAF GHC.IO.Handle.FD 128 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 120 0 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 110 0 0.0 0.0 0.0 0.0
CAF Main 108 0 0.0 0.0 100.0 100.0
main Main 204 1 0.0 0.0 100.0 100.0
main.g Main 207 1 0.0 0.0 0.0 0.1
fib Main 208 1973 0.0 0.1 0.0 0.1
main.f Main 205 1 0.0 0.0 100.0 99.9
fib Main 206 2692537 100.0 99.9 100.0 99.9
Now although we had two calls to ``fib`` in the program, it is
immediately clear that it was the call from ``f`` which took all the
time. The functions ``f`` and ``g`` which are defined in the ``where``
clause in ``main`` are given their own cost centres, ``main.f`` and
``main.g`` respectively.
The actual meaning of the various columns in the output is:
The number of times this particular point in the call tree was
entered.
The percentage of the total run time of the program spent at this
point in the call tree.
The percentage of the total memory allocations (excluding profiling
overheads) of the program made by this call.
The percentage of the total run time of the program spent below this
point in the call tree.
The percentage of the total memory allocations (excluding profiling
overheads) of the program made by this call and all of its
sub-calls.
In addition you can use the :rts-flag:`-P` RTS option to get the
following additional information:
``ticks``
The raw number of time “ticks” which were attributed to this
cost-centre; from this, we get the ``%time`` figure mentioned above.
``bytes``
Number of bytes allocated in the heap while in this cost-centre;
again, this is the raw number from which we get the ``%alloc``
figure mentioned above.
What about recursive functions, and mutually recursive groups of
functions? Where are the costs attributed? Well, although GHC does keep
information about which groups of functions called each other
recursively, this information isn't displayed in the basic time and
allocation profile, instead the call-graph is flattened into a tree as
follows: a call to a function that occurs elsewhere on the current stack
does not push another entry on the stack, instead the costs for this
call are aggregated into the caller [2]_.
.. _scc-pragma:
Inserting cost centres by hand
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Cost centres are just program annotations. When you say ``-fprof-auto``
to the compiler, it automatically inserts a cost centre annotation
around every binding not marked INLINE in your program, but you are
entirely free to add cost centre annotations yourself. Be careful adding too many
cost-centre annotations as the optimiser is careful to not move them around or
remove them, which can severly affect how your program is optimised and hence the
runtime performance!
The syntax of a cost centre annotation for expressions is ::
{-# SCC "name" #-}
where ``"name"`` is an arbitrary string, that will become the name of
your cost centre as it appears in the profiling output, and
```` is any Haskell expression. An ``SCC`` annotation extends as
far to the right as possible when parsing, having the same precedence as lambda
abstractions, let expressions, and conditionals. Additionally, an annotation
may not appear in a position where it would change the grouping of
subexpressions::
a = 1 / 2 / 2 -- accepted (a=0.25)
b = 1 / {-# SCC "name" #-} 2 / 2 -- rejected (instead of b=1.0)
This restriction is required to maintain the property that inserting a pragma,
just like inserting a comment, does not have unintended effects on the
semantics of the program, in accordance with `GHC Proposal #176
`__.
SCC stands for "Set Cost Centre". The double quotes can be omitted if ``name``
is a Haskell identifier starting with a lowercase letter, for example: ::
{-# SCC id #-}
Cost centre annotations can also appear in the top-level or in a
declaration context. In that case you need to pass a function name
defined in the same module or scope with the annotation. Example: ::
f x y = ...
where
g z = ...
{-# SCC g #-}
{-# SCC f #-}
If you want to give a cost centre different name than the function name,
you can pass a string to the annotation ::
f x y = ...
{-# SCC f "cost_centre_name" #-}
Here is an example of a program with a couple of SCCs: ::
main :: IO ()
main = do let xs = [1..1000000]
let ys = [1..2000000]
print $ {-# SCC last_xs #-} last xs
print $ {-# SCC last_init_xs #-} last (init xs)
print $ {-# SCC last_ys #-} last ys
print $ {-# SCC last_init_ys #-} last (init ys)
which gives this profile when run:
.. code-block:: none
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 102 0 0.0 0.0 100.0 100.0
CAF GHC.IO.Handle.FD 130 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 122 0 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 111 0 0.0 0.0 0.0 0.0
CAF Main 108 0 0.0 0.0 100.0 100.0
main Main 204 1 0.0 0.0 100.0 100.0
last_init_ys Main 210 1 25.0 27.4 25.0 27.4
main.ys Main 209 1 25.0 39.2 25.0 39.2
last_ys Main 208 1 12.5 0.0 12.5 0.0
last_init_xs Main 207 1 12.5 13.7 12.5 13.7
main.xs Main 206 1 18.8 19.6 18.8 19.6
last_xs Main 205 1 6.2 0.0 6.2 0.0
.. _prof-rules:
Rules for attributing costs
~~~~~~~~~~~~~~~~~~~~~~~~~~~
While running a program with profiling turned on, GHC maintains a
cost-centre stack behind the scenes, and attributes any costs (memory
allocation and time) to whatever the current cost-centre stack is at the
time the cost is incurred.
The mechanism is simple: whenever the program evaluates an expression
with an SCC annotation, ``{-# SCC c -#} E``, the cost centre ``c`` is
pushed on the current stack, and the entry count for this stack is
incremented by one. The stack also sometimes has to be saved and
restored; in particular when the program creates a thunk (a lazy
suspension), the current cost-centre stack is stored in the thunk, and
restored when the thunk is evaluated. In this way, the cost-centre stack
is independent of the actual evaluation order used by GHC at runtime.
At a function call, GHC takes the stack stored in the function being
called (which for a top-level function will be empty), and *appends* it
to the current stack, ignoring any prefix that is identical to a prefix
of the current stack.
We mentioned earlier that lazy computations, i.e. thunks, capture the
current stack when they are created, and restore this stack when they
are evaluated. What about top-level thunks? They are "created" when the
program is compiled, so what stack should we give them? The technical
name for a top-level thunk is a CAF ("Constant Applicative Form"). GHC
assigns every CAF in a module a stack consisting of the single cost
centre ``M.CAF``, where ``M`` is the name of the module. It is also
possible to give each CAF a different stack, using the option
:ghc-flag:`-fprof-cafs`. This is especially useful when
compiling with :ghc-flag:`-ffull-laziness` (as is default with :ghc-flag:`-O`
and higher), as constants in function bodies will be lifted to the top-level
and become CAFs. You will probably need to consult the Core
(:ghc-flag:`-ddump-simpl`) in order to determine what these CAFs correspond to.
.. index::
single: -fprof-cafs
.. _prof-foreign-calls:
Profiling and foreign calls
---------------------------
Simply put, the profiler includes time spent in unsafe foreign
calls but ignores time taken in safe foreign calls. For example, time spent blocked on IO
operations (e.g. ``getLine``) is not accounted for in the profile as ``getLine`` is implemented
using a safe foreign call.
The profiler estimates CPU time, for Haskell threads within the program only.
In particular, time "taken" by the program in blocking safe foreign calls
is not accounted for in time profiles. The runtime has the notion of a virtual
processor which is known as a "capability". Haskell threads are run on capabilities,
and the profiler samples the capabilities in order to determine what is being
executed at a certain time. When a safe foreign call is executed, it's run outside
the context of a capability; hence the sampling does not account for the time
taken. Whilst the safe call is executed, other
Haskell threads are free to run on the capability, and their cost will be attributed
to the profiler. When the safe call is finished, the blocked, descheduled thread can
be resumed and rescheduled.
However, the time taken by blocking on unsafe foreign calls is accounted for in the profile.
This happens because unsafe foreign calls are executed by the same capability
their calling Haskell thread is running on. Therefore, an unsafe foreign call will
block the entire capability whilst it is running, and any time the capability is
sampled the "cost" of the foreign call will be attributed to the calling cost-centre stack.
However, do note that you are not supposed to use unsafe foreign calls for any
operations which do block! Do not be tempted to replace your safe foreign calls
with unsafe calls just so they appear in the profile. This prevents GC from
happening until the foreign call returns, which can be catastrophic for performance.
.. _prof-compiler-options:
Compiler options for profiling
------------------------------
.. index::
single: profiling; options
single: options; for profiling
.. ghc-flag:: -prof
:shortdesc: Turn on profiling
:type: dynamic
:category:
To make use of the profiling system *all* modules must be compiled
and linked with the :ghc-flag:`-prof` option. Any ``SCC`` annotations you've
put in your source will spring to life.
Without a :ghc-flag:`-prof` option, your ``SCC``\ s are ignored; so you can
compile ``SCC``-laden code without changing it.
.. ghc-flag:: -fno-prof-count-entries
:shortdesc: Do not collect entry counts
:type: dynamic
:reverse: -fprof-count-entries
:category:
Tells GHC not to collect information about how often functions are
entered at runtime (the "entries" column of the time profile), for
this module. This tends to make the profiled code run faster, and
hence closer to the speed of the unprofiled code, because GHC is
able to optimise more aggressively if it doesn't have to maintain
correct entry counts. This option can be useful if you aren't
interested in the entry counts (for example, if you only intend to
do heap profiling).
There are a few other profiling-related compilation options. Use them
*in addition to* :ghc-flag:`-prof`. These do not have to be used consistently
for all modules in a program.
Automatically placing cost-centres
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
GHC has a number of flags for automatically inserting cost-centres into the
compiled program. Use these options carefully because inserting too many cost-centres
in the wrong places will mean the optimiser will be less effective and the runtime behaviour
of your profiled program will be different to that of the unprofiled one.
.. ghc-flag:: -fprof-callers=⟨name⟩
:shortdesc: Auto-add ``SCC``\\ s to all call-sites of the named function.
:type: dynamic
:category:
Automatically enclose all occurrences of the named function in an ``SCC``.
Note that these cost-centres are added late in compilation (after
simplification) and consequently the names may be slightly different than
they appear in the source program (e.g. a call to ``f`` may inlined with
its wrapper, resulting in an occurrence of its worker, ``$wf``).
In addition to plain module-qualified names (e.g. ``GHC.Base.map``),
⟨name⟩ also accepts a small globbing language using ``*`` as a wildcard
symbol:
.. code-block:: none
pattern := '.'
module := '*'
|
identifier :=
ident
For instance, the following are all valid patterns: