Profiling Graphics

Purpose

This page describes a simple way (via perl) to have a graphical insight of the output of the R profiler.

Usual Profiling

Usually, the R profiler outputs numbers which can allow the user to assess which functions are slow and can be improved. See here for an example of the use of the Rprof function. What is missing from that output is the notion of what function calls what function which could then be useful to figure out why a given function is slow.

Obviously, the sharp R programmer can have a go at crunching the raw output of the profiler, generally Rprof.out to get that sort of information.

 > Rprof(); example(glm) ; Rprof(NULL)
 > cat(readLines("Rprof.out"), sep = "\n")
sample.interval=20000
"data.frame" "print" "eval.with.vis" "eval.with.vis" "source" "example" 
"linkinv" "glm.fit" "glm" "eval.with.vis" "eval.with.vis" "source" "example" 
"as.integer" "substr" "print.anova" "print" "source" "example" 
".readRDS" "<Anonymous>" "eval.with.vis" "eval.with.vis" "source" "example" 
"inherits" "is.factor" "match" "%in%" "deparse" "eval" "match.arg" "sort.int" "sort.default" "sort" 
"symnum" "printCoefmat" "print.summary.glm" "print" "source" "example" 
"pmax" "formatC" "paste" "quantile.default" "quantile" "print.summary.glm" "print" "source" "example" 
">" "switch" "residuals.glm" "residuals" "summary.glm" "summary" "eval.with.vis" "eval.with.vis"
"source" "example" 

So, as an example, if I want to know who “source” is calling, I could torture this file, but this is becoming really cryptic.

 > rl <- strsplit( gsub("\"", "",readLines("Rprof.out")[-1]), " ")
 > funs <- sapply( rl, function(x) {  x[ which(x == "source") - 1 ]   } )
 > table( funs )
funs
eval.with.vis         print 
            4             3 

Graphviz wants to play

The files generated by the profiler already contains all information about what function calls what other function. We can almost use that information directly and supply it to dot from graphviz. As an example, a simple dot file generated from the first line of the Rprof.out file would look like this:

digraph{ 
  graph [ rankdir = "LR"];
  "example" -> "source" -> "eval.with.vis" -> "eval.with.vis" -> "print" -> "data.frame"  
}

and can be processed by dot into several formats:

dot test.dot -Tsvg > test.svg

:tips:misc:test2.png

Perl wants to play too

We just need to find something to convert the Rprof.out file into a suitable dot format. We could do that in R, but in order to increase the fun in here, let’s use a language that is built to play with text: perl. There is already a perl script shipped with R to crunch the Rprof.out file, so i just took my inspiration from there. The current version of the script can be found here and older versions under here.

Call this Rprof2dot and store it in the bin directory of your R installation 1). Now you can call this script 2) via the following command to generate the dot file.

[romain@fedora tmp]$ R CMD Rprof2dot Rprof.out
digraph {
graph [ rankdir = "LR"]; 
"source" [shape=rect,fontsize=6,label="source\n(7)"] 
"example" [shape=rect,fontsize=6,label="example\n(7)"] 
"eval.with.vis" [shape=rect,fontsize=6,label="eval.with.vis\n(8)"] 
 "example" -> "source" [label=7,fontsize=6]
}

You can now save the output in a dot file, or directly pipe it to the dot command:

[romain@fedora tmp]$ R CMD Rprof2dot Rprof.out > test.dot
[romain@fedora tmp]$ R CMD Rprof2dot Rprof.out | dot -Tpng > test3.png

:tips:misc:test3.png

By default, the perl script leaves only the boxes that are called at least 5 times, but you can change that if you fancy by using the –cutoff flag. Let’s say that for this really simple example, we want to get everything, so we’ll remove all the functions called less than 0 times (cutoff = 0).

[romain@fedora tmp]$ R CMD Rprof2dot --cutoff=0 Rprof.out | dot -Tpng > test4.png

:tips:misc:test4.png

Restrict the input

I’ve implemented so far three ways to restrict what data from the profiler’s output is used as input to produce the dot file:

  • The blacklist will remove functions
  • The whitelist will keep only a set of functions
  • The restrict will allow to keep a set of functions and the one called by it and the ones that it calls

Blacklist

Sometimes, you’d like some functions not to appear in the profiling results, specially as this graph gets big. For example, when using anonymous functions such as in a sapply call, the function will be reported as <Anonymous> without any way to distinguish between one anonymous function and another one. In that case, it is quite useful to just say “I don’t want the <Anonymous> function to be a part of the graphic”.

The script allows a –blacklist flag to indicate a file containing a set of undesirable functions or regular expressions. For example, this file:

eval.with.vis
<Anonymous>
sort\.[^"]*
print[^"]*

will result in removing the functions eval.with.vis, <Anonymous>, all that match sort\.[^”] and all that match print[^”] from the file before generating the dot code. One can then use this file to clean the profiler output :

$ R CMD Rprof2dot --cutoff=0 Rprof.out --blacklist=blacklist | dot -Tpng > test5.png

:tips:misc:test5.png

Because of the way the regex is used in the script, using the dot can be dangerous because it would match the " that ends the word. Therefore i’ve used [^”] in here. I agree that this is not pretty, and i will try to find something better. Open for suggestions. One way is to use the non-greedy matching, something like: sort\..*? but it would be nice to find something else. Also, this does not allow to use $ or ^ to specify begin or end of the function name.

Whitelist

On the other end, when someone is only interested in a subset of functions that are on the profiler’s output 3), the script allows to declare a whitelist. The whitelist consists of a file where each line gives a regex that functions must match in order to be included in the graph.

As an example, we would like to profile the examples in the xyplot help page to know how the grid package is used by lattice (why not?). In that case, we would “whitelist” all the grid functions and print.trellis (which is the function heavily calling grid).

require(lattice)
Rprof( )
example( xyplot, ask = F )
Rprof( NULL )
# list of functions in grid
cut -f1 /usr/local/lib/R/library/grid/help/AnIndex | sed "s/\./\\\./g" | grep -v "-" > whitelist

# manually add print.trellis ( and escaping the dot )
echo "print\\.trellis" >> whitelist

# making the graph
R CMD Rprof2dot Rprof.out --cutoff=0 --whitelist=whitelist  | dot -Tpng > xyplot.png

:tips:misc:xyplot.png

Restrict

The third way to restrict the data is a bit more complicated, but can be useful. It basically allows to keep only a subset of functions and:

  • a given number of functions that are calling them (before in the call stack). That number can be set to 0, in which case no function before the current function is kept. That number can also be set to * in which case all functions before are kept
  • a given number of functions that are called by it. This number can also be set to 0 or *.

An example restrict file is given by the following which means “keep the grob function and up to 3 functions before and after it”

grob,3,3

:tips:misc:grob.png

Combination

restrict can be combined with whitelist or blacklist. In that case, the blacklist or whitelist are applied before the restrict.

What's next

  • maybe some sort of blacklist, functions that should be removed from the Rprof.out before the dot is computed (<Anonymous> for example) done.
  • Use a different arrow style (dotted?) when some functions have been removed ?
  • have more options, maybe move to an xml kind of option file.
  • I’m only interested in profiling where these functions are involved. only the children of a given function, only its parents, only both, etc... done in restrict
  • Hyperlinks to source code of the function if the output format is svg or maybe more usefully to the documentation of the function. Do i need to call R to find that information ?
  • more information in the boxes. Maybe a sort of graphical insight on the number of times a function is called. self calls, ... see this example on the graphviz gallery where color is used to show hot spots.
  • A different color (or some other characteristic) for each package, do I need R to get involved or can I just crunch data from library/*/help/AnIndex to perform that
  • The same sort of perl script could generate some xul code to represent the profiling as a tree in a user interface manner, the two combined could look quite fancy.
  • Output as a tag cloud as well ?
  • Generate the same sort of graph, not for profiling but for actual dependencies in the source code. That can be a bit tricky ( things like do.call can get in the way). Maybe there could be a way to use Doxygen for that.
  • Distribute it as an R package ?
  • Make subroutines, clean the code
  • You tell me ...

References

1) so /usr/local/lib/R/bin/Rprof2dot for a typical linux install
2) obviously you need Perl to be installed on your machine, you should have it anyway, it’s a fantastic language
3) Typically when building a package, the focus tends to be on the functions of the package
 
tips/misc/profiling.txt · Last modified: 2007/07/04
 
Recent changes RSS feed R Wiki powered by Driven by DokuWiki and optimized for Firefox Creative Commons License