profiling.rst



.. _profiling:

Profiling

=========

.. index::

   single: profiling

   single: cost-centre profiling

   single: -p; RTS option

GHC comes with a time and space profiling system, so that you can answer

questions like "why is my program so slow?", or "why is my program using

so much memory?". We'll start by describing how to do time profiling.

Time profiling a program is a three-step process:

1. Re-compile your program for profiling with the :ghc-flag:`-prof` option, and

   probably one of the options for adding automatic annotations:

   :ghc-flag:`-fprof-late` is the recommended option.

2. Having compiled the program for profiling, you now need to run it to

   generate the profile. For example, a simple time profile can be

   generated by running the program with ``+RTS -p`` (see :rts-flag:`-p`), which

   generates a file named :file:`{prog}.prof` where ⟨prog⟩ is the name of your

   program (without the ``.exe`` extension, if you are on Windows).

   There are many different kinds of profile that can be generated,

   selected by different RTS options. We will be describing the various

   kinds of profile throughout the rest of this chapter. Some profiles

   require further processing using additional tools after running the

   program.

3. Examine the generated profiling information, use the information to

   optimise your program, and repeat as necessary.

The time profiler measures the CPU time taken by the Haskell code in your application.

In particular time taken by safe foreign calls is not tracked by the profiler (see :ref:`prof-foreign-calls`).

.. _cost-centres:

Cost centres and cost-centre stacks

-----------------------------------

GHC's profiling system assigns costs to cost centres. A cost is simply

the time or space (memory) required to evaluate an expression. Cost

centres are program annotations around expressions; all costs incurred

by the annotated expression are assigned to the enclosing cost centre.

Furthermore, GHC will remember the stack of enclosing cost centres for

any given expression at run-time and generate a call-tree of cost

attributions.

Let's take a look at an example: ::

    main = print (fib 30)

    fib n = if n < 2 then 1 else fib (n-1) + fib (n-2)

Compile and run this program as follows:

.. code-block:: none

    $ ghc -prof -fprof-auto -rtsopts Main.hs

    $ ./Main +RTS -p

    121393

    $

When a GHC-compiled program is run with the :rts-flag:`-p` RTS option, it

generates a file called :file:`prog.prof`. In this case, the file will contain

something like this:

.. code-block:: none

            Wed Oct 12 16:14 2011 Time and Allocation Profiling Report  (Final)

               Main +RTS -p -RTS

            total time  =        0.68 secs   (34 ticks @ 20 ms)

            total alloc = 204,677,844 bytes  (excludes profiling overheads)

    COST CENTRE MODULE  %time %alloc

    fib         Main    100.0  100.0

                                                          individual     inherited

    COST CENTRE MODULE                  no.     entries  %time %alloc   %time %alloc

    MAIN        MAIN                    102           0    0.0    0.0   100.0  100.0

     CAF        GHC.IO.Handle.FD        128           0    0.0    0.0     0.0    0.0

     CAF        GHC.IO.Encoding.Iconv   120           0    0.0    0.0     0.0    0.0

     CAF        GHC.Conc.Signal         110           0    0.0    0.0     0.0    0.0

     CAF        Main                    108           0    0.0    0.0   100.0  100.0

      main      Main                    204           1    0.0    0.0   100.0  100.0

       fib      Main                    205     2692537  100.0  100.0   100.0  100.0

The first part of the file gives the program name and options, and the

total time and total memory allocation measured during the run of the

program (note that the total memory allocation figure isn't the same as

the amount of *live* memory needed by the program at any one time; the

latter can be determined using heap profiling, which we will describe

later in :ref:`prof-heap`).

The second part of the file is a break-down by cost centre of the most

costly functions in the program. In this case, there was only one

significant function in the program, namely ``fib``, and it was

responsible for 100% of both the time and allocation costs of the

program.

The third and final section of the file gives a profile break-down by

cost-centre stack. This is roughly a call-tree profile of the program.

In the example above, it is clear that the costly call to ``fib`` came

from ``main``.

The time and allocation incurred by a given part of the program is

displayed in two ways: “individual”, which are the costs incurred by the

code covered by this cost centre stack alone, and “inherited”, which

includes the costs incurred by all the children of this node.

The usefulness of cost-centre stacks is better demonstrated by modifying

the example slightly: ::

    main = print (f 30 + g 30)

      where

        f n  = fib n

        g n  = fib (n `div` 2)

    fib n = if n < 2 then 1 else fib (n-1) + fib (n-2)

Compile and run this program as before, and take a look at the new

profiling results:

.. code-block:: none

    COST CENTRE MODULE                  no.     entries  %time %alloc   %time %alloc

    MAIN        MAIN                    102           0    0.0    0.0   100.0  100.0

     CAF        GHC.IO.Handle.FD        128           0    0.0    0.0     0.0    0.0

     CAF        GHC.IO.Encoding.Iconv   120           0    0.0    0.0     0.0    0.0

     CAF        GHC.Conc.Signal         110           0    0.0    0.0     0.0    0.0

     CAF        Main                    108           0    0.0    0.0   100.0  100.0

      main      Main                    204           1    0.0    0.0   100.0  100.0

       main.g   Main                    207           1    0.0    0.0     0.0    0.1

        fib     Main                    208        1973    0.0    0.1     0.0    0.1

       main.f   Main                    205           1    0.0    0.0   100.0   99.9

        fib     Main                    206     2692537  100.0   99.9   100.0   99.9

Now although we had two calls to ``fib`` in the program, it is

immediately clear that it was the call from ``f`` which took all the

time. The functions ``f`` and ``g`` which are defined in the ``where``

clause in ``main`` are given their own cost centres, ``main.f`` and

``main.g`` respectively.

The actual meaning of the various columns in the output is:

    The number of times this particular point in the call tree was

    entered.

    The percentage of the total run time of the program spent at this

    point in the call tree.

    The percentage of the total memory allocations (excluding profiling

    overheads) of the program made by this call.

    The percentage of the total run time of the program spent below this

    point in the call tree.

    The percentage of the total memory allocations (excluding profiling

    overheads) of the program made by this call and all of its

    sub-calls.

In addition you can use the :rts-flag:`-P` RTS option to get the

following additional information:

``ticks``

    The raw number of time “ticks” which were attributed to this

    cost-centre; from this, we get the ``%time`` figure mentioned above.

``bytes``

    Number of bytes allocated in the heap while in this cost-centre;

    again, this is the raw number from which we get the ``%alloc``

    figure mentioned above.

What about recursive functions, and mutually recursive groups of

functions? Where are the costs attributed? Well, although GHC does keep

information about which groups of functions called each other

recursively, this information isn't displayed in the basic time and

allocation profile, instead the call-graph is flattened into a tree as

follows: a call to a function that occurs elsewhere on the current stack

does not push another entry on the stack, instead the costs for this

call are aggregated into the caller [2]_.

.. _scc-pragma:

Inserting cost centres by hand

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Cost centres are just program annotations. When you say ``-fprof-auto``

to the compiler, it automatically inserts a cost centre annotation

around every binding not marked INLINE in your program, but you are

entirely free to add cost centre annotations yourself. Be careful adding too many

cost-centre annotations as the optimiser is careful to not move them around or

remove them, which can severly affect how your program is optimised and hence the

runtime performance!

The syntax of a cost centre annotation for expressions is ::

    {-# SCC "name" #-} 

where ``"name"`` is an arbitrary string, that will become the name of

your cost centre as it appears in the profiling output, and

```` is any Haskell expression. An ``SCC`` annotation extends as

far to the right as possible when parsing, having the same precedence as lambda

abstractions, let expressions, and conditionals. Additionally, an annotation

may not appear in a position where it would change the grouping of

subexpressions::

  a = 1 / 2 / 2                          -- accepted (a=0.25)

  b = 1 / {-# SCC "name" #-} 2 / 2       -- rejected (instead of b=1.0)

This restriction is required to maintain the property that inserting a pragma,

just like inserting a comment, does not have unintended effects on the

semantics of the program, in accordance with `GHC Proposal #176

`__.

SCC stands for "Set Cost Centre". The double quotes can be omitted if ``name``

is a Haskell identifier starting with a lowercase letter, for example: ::

    {-# SCC id #-} 

Cost centre annotations can also appear in the top-level or in a

declaration context. In that case you need to pass a function name

defined in the same module or scope with the annotation. Example: ::

    f x y = ...

      where

        g z = ...

        {-# SCC g #-}

    {-# SCC f #-}

If you want to give a cost centre different name than the function name,

you can pass a string to the annotation ::

    f x y = ...

    {-# SCC f "cost_centre_name" #-}

Here is an example of a program with a couple of SCCs: ::

    main :: IO ()

    main = do let xs = [1..1000000]

              let ys = [1..2000000]

              print $ {-# SCC last_xs #-} last xs

              print $ {-# SCC last_init_xs #-} last (init xs)

              print $ {-# SCC last_ys #-} last ys

              print $ {-# SCC last_init_ys #-} last (init ys)

which gives this profile when run:

.. code-block:: none

    COST CENTRE     MODULE                  no.     entries  %time %alloc   %time %alloc

    MAIN            MAIN                    102           0    0.0    0.0   100.0  100.0

     CAF            GHC.IO.Handle.FD        130           0    0.0    0.0     0.0    0.0

     CAF            GHC.IO.Encoding.Iconv   122           0    0.0    0.0     0.0    0.0

     CAF            GHC.Conc.Signal         111           0    0.0    0.0     0.0    0.0

     CAF            Main                    108           0    0.0    0.0   100.0  100.0

      main          Main                    204           1    0.0    0.0   100.0  100.0

       last_init_ys Main                    210           1   25.0   27.4    25.0   27.4

       main.ys      Main                    209           1   25.0   39.2    25.0   39.2

       last_ys      Main                    208           1   12.5    0.0    12.5    0.0

       last_init_xs Main                    207           1   12.5   13.7    12.5   13.7

       main.xs      Main                    206           1   18.8   19.6    18.8   19.6

       last_xs      Main                    205           1    6.2    0.0     6.2    0.0

.. _prof-rules:

Rules for attributing costs

~~~~~~~~~~~~~~~~~~~~~~~~~~~

While running a program with profiling turned on, GHC maintains a

cost-centre stack behind the scenes, and attributes any costs (memory

allocation and time) to whatever the current cost-centre stack is at the

time the cost is incurred.

The mechanism is simple: whenever the program evaluates an expression

with an SCC annotation, ``{-# SCC c -#} E``, the cost centre ``c`` is

pushed on the current stack, and the entry count for this stack is

incremented by one. The stack also sometimes has to be saved and

restored; in particular when the program creates a thunk (a lazy

suspension), the current cost-centre stack is stored in the thunk, and

restored when the thunk is evaluated. In this way, the cost-centre stack

is independent of the actual evaluation order used by GHC at runtime.

At a function call, GHC takes the stack stored in the function being

called (which for a top-level function will be empty), and *appends* it

to the current stack, ignoring any prefix that is identical to a prefix

of the current stack.

We mentioned earlier that lazy computations, i.e. thunks, capture the

current stack when they are created, and restore this stack when they

are evaluated. What about top-level thunks? They are "created" when the

program is compiled, so what stack should we give them? The technical

name for a top-level thunk is a CAF ("Constant Applicative Form"). GHC

assigns every CAF in a module a stack consisting of the single cost

centre ``M.CAF``, where ``M`` is the name of the module. It is also

possible to give each CAF a different stack, using the option

:ghc-flag:`-fprof-cafs`. This is especially useful when

compiling with :ghc-flag:`-ffull-laziness` (as is default with :ghc-flag:`-O`

and higher), as constants in function bodies will be lifted to the top-level

and become CAFs. You will probably need to consult the Core

(:ghc-flag:`-ddump-simpl`) in order to determine what these CAFs correspond to.

.. index::

   single: -fprof-cafs

.. _prof-foreign-calls:

Profiling and foreign calls

---------------------------

Simply put, the profiler includes time spent in unsafe foreign

calls but ignores time taken in safe foreign calls. For example, time spent blocked on IO

operations (e.g. ``getLine``) is not accounted for in the profile as ``getLine`` is implemented

using a safe foreign call.

The profiler estimates CPU time, for Haskell threads within the program only.

In particular, time "taken" by the program in blocking safe foreign calls

is not accounted for in time profiles. The runtime has the notion of a virtual

processor which is known as a "capability". Haskell threads are run on capabilities,

and the profiler samples the capabilities in order to determine what is being

executed at a certain time. When a safe foreign call is executed, it's run outside

the context of a capability; hence the sampling does not account for the time

taken. Whilst the safe call is executed, other

Haskell threads are free to run on the capability, and their cost will be attributed

to the profiler. When the safe call is finished, the blocked, descheduled thread can

be resumed and rescheduled.

However, the time taken by blocking on unsafe foreign calls is accounted for in the profile.

This happens because unsafe foreign calls are executed by the same capability

their calling Haskell thread is running on. Therefore, an unsafe foreign call will

block the entire capability whilst it is running, and any time the capability is

sampled the "cost" of the foreign call will be attributed to the calling cost-centre stack.

However, do note that you are not supposed to use unsafe foreign calls for any

operations which do block! Do not be tempted to replace your safe foreign calls

with unsafe calls just so they appear in the profile. This prevents GC from

happening until the foreign call returns, which can be catastrophic for performance.

.. _prof-compiler-options:

Compiler options for profiling

------------------------------

.. index::

   single: profiling; options

   single: options; for profiling

.. ghc-flag:: -prof

    :shortdesc: Turn on profiling

    :type: dynamic

    :category:

    To make use of the profiling system *all* modules must be compiled

    and linked with the :ghc-flag:`-prof` option. Any ``SCC`` annotations you've

    put in your source will spring to life.

    Without a :ghc-flag:`-prof` option, your ``SCC``\ s are ignored; so you can

    compile ``SCC``-laden code without changing it.

.. ghc-flag:: -fno-prof-count-entries

    :shortdesc: Do not collect entry counts

    :type: dynamic

    :reverse: -fprof-count-entries

    :category:

    Tells GHC not to collect information about how often functions are

    entered at runtime (the "entries" column of the time profile), for

    this module. This tends to make the profiled code run faster, and

    hence closer to the speed of the unprofiled code, because GHC is

    able to optimise more aggressively if it doesn't have to maintain

    correct entry counts. This option can be useful if you aren't

    interested in the entry counts (for example, if you only intend to

    do heap profiling).

There are a few other profiling-related compilation options. Use them

*in addition to* :ghc-flag:`-prof`. These do not have to be used consistently

for all modules in a program.

Automatically placing cost-centres

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

GHC has a number of flags for automatically inserting cost-centres into the

compiled program. Use these options carefully because inserting too many cost-centres

in the wrong places will mean the optimiser will be less effective and the runtime behaviour

of your profiled program will be different to that of the unprofiled one.

.. ghc-flag:: -fprof-callers=⟨name⟩

    :shortdesc: Auto-add ``SCC``\\ s to all call-sites of the named function.

    :type: dynamic

    :category:

    Automatically enclose all occurrences of the named function in an ``SCC``.

    Note that these cost-centres are added late in compilation (after

    simplification) and consequently the names may be slightly different than

    they appear in the source program (e.g. a call to ``f`` may inlined with

    its wrapper, resulting in an occurrence of its worker, ``$wf``).

    In addition to plain module-qualified names (e.g. ``GHC.Base.map``),

    ⟨name⟩ also accepts a small globbing language using ``*`` as a wildcard

    symbol:

    .. code-block:: none

        pattern    :=  '.' 

        module     := '*'

                    | 

        identifier := 

        ident

    For instance, the following are all valid patterns: