Introduction to R

Vincent Zoonekynd 2005-08-28

Features

Statistics versus Signal processing

It is a statistical software: contrary to other numerical computation software (Scilab, Octave), it already provides functions to perform non trivial statistical operations, be they classic (regression, logistic regression, analysis of variance (anova), decision trees, principal component analysis, etc.) or more modern (neural networks, bootstrap, generalized additive models (GAM), mixed models, etc.).

However, sometimes, in statistics, we use a lot of signal processing algorithms (Fourrier transform, wavelet transforms, etc.); if this is your case, you might find Scilab or Matlab more appropriate.

GUI -- or lack thereof

It is a real programming language, not a point-and-click software (in French I have a good neologism to describe those “point-and-click” programs: “cliquodromes” – if someone knows of a good English translation, he is welcome): we are not limited by the software designers’ imagination, we can use it to any means. (why don’t you just use “cliquodrome”? everybody speaking English understands it!)

If you really need a simple and powerful GUI, have a look at R Commander, mentionned later in this document. For a list of GUIs, check the using-gui tips.

Speed issues

R is an interpreted language: the advantage, is that we spend less time writing code, the drawback is that the computations are slower – but unless you wish to do real-time computations on the National Insurance files, taking into account the whole British population, it is sufficiently fast.

If speed is really an issue, you can think to write loop intensive code of your custom function in C and then link the library to R by an easy interface function: .C Actually R is as fast as Matlab an SAS in many application, it just need some trick (like in every other language) to achieve such high performance.

May be you’d like to have a look at other efficient software like SAS (commercial, expensive, dating back to the mainframe era), DAP (free but far from complete: it was initially designed as a free replacement for SAS, but turned out to be a C library to perform statistical computations) or you can program everything yourself (in C or C++), with the help of a few libraries.

TODO list of (C, C++, Java) libraries for linear algebra and statistical computations.

or (it is the approach I would advise) you can start to program in R, until you have a slow but correct program, profile your program (it means: find where it spends most of its time), try to change the algorithm so that it be faster and, if it fails, implement in C the most computation-intensive parts. If you try to optimize your program too early, you run the risk of either losing your time by optimizing parts of the implementation that have no real importance on the total time and getting a program with an awkward structure, very difficult to modify or extend – or simply giving up your project before its completion.

Memory issues

There is another problem in R: it tends to load all the data into memory (this is slowly changing, starting with R version 2). For very large data, R might not be the best choice: I recently spoke to a statistician working with genetic data: their data were too large for R, too large for SAS and they had to resort to home-made programs. Yet, I said above, R might be a platform of choice for prototyping, i.e., to write the first versions of an application, while we know neither which algorithms to choose, nor how long the computations will take. Furthermore, as R can access most databases (some, such as PostgreSQL, even allow you to write stored procedures in R), you can store your data there and only retrieve the bits you need, when you need them. Personnaly, the first time I ran into memory problems once, I was playing with a time series containing several minutes of music.

Graphics

R can produce graphics, but it is not a presentation software: for informative pictures, for graphical data exploration, it is fine, but if you want to use them in an advert, to impress or deceive your customers, you might have to process them through other software – for instance, the title image of this document was created with R, PovRay and Gimp.

Freedom

Finally, R is a free software (”free” as in “free speech” – and also, incidentally, as in “free beer”). But beware: some (rare) libraries are not free but just “freely useable for non commercial purposes”: be sure to read the licence.

Examples

Installing R

Windows

You should really consider Linux instead of Windows. People using R on Windows sometimes have problems: especially memory problems (Windows specialists tell me that this operating system becomes unstable when a process wants to use more than 1Gb of memory – and I confirm the problems) and installation problems (if you want to install a package, you will have to find a binary version of it, for the very version of R you have installed – alternatively, you can try to compile the packages from source, but this is very tricky, because Windows lacks all the tools for this: you end up installing Unix utilities on your Windows machine, but you will also run into version problems), anyway there are Windows build of library updated very often.

Not convinced?

 > library(fortunes)
 > fortune("install")
 
 Benjamin Lloyd-Hughes: Has anyone had any joy getting the rgdal package to
 compile under windows?
 Roger Bivand: The closest anyone has got so far is Hisaji Ono, who used MSYS
 (http://www.mingw.org/) to build PROJ.4 and GDAL (GDAL depends on PROJ.4,
 PROJ.4 needs a PATH to metadata files for projection and transformation), and
 then hand-pasted the paths to the GDAL headers and library into src/Makevars,
 running Rcmd INSTALL rgdal at the Windows command prompt as usual. All of this
 can be repeated, but is not portable, and does not suit the very valuable
 standard binary package build system for Windows. Roughly: [points 1 to 5 etc
 omitted]
 
 Barry Rowlingson: At some point the complexity of installing things like this
 for Windows will cross the complexity of installing Linux... (PS excepting
 live-Linux installs like Knoppix)
 
    -- Benjamin Lloyd-Hughes, Roger Bivand and Barry Rowlingson
       R-help (August 2004)

Still not convinced? Well, good luck (you will need it). You can download a Windows version of R from http://cran.r-project.org/bin/windows/base/

[This must have been written a long time ago since not only are there no significant problems with the Windows version of R of the sort described but its particularly easy to install since there is a GUI-based professional installer and once you install it it has a nice multi-window user interface. If you are writing packages such as those that appear on CRAN you will have to download some additional utilities but that is just a one time thing and if you are just developing code to source() which is what most people do, not packages, then you don’t even need to do that. -gg]

Linux: installing a binary version of R

I am using Mandriva Linux 2005 LE (formerly known as Mandrake 10.2), but it should be the same with most distributions.

To install a binary version, just type

urpmi R-base R-bioconductor

That is all. No need to roam the web to find where you can download the software, no need to answer lengthy questions about where to install the software, no need to do anything yourself.

Linux: installing R from the sources

I proceed as follows:

# Installing R, from source
urpmi gcc-g77 readline-devel libgraphviz7-devel libgmp3-devel
urpmi perl-libxml-perl liblapack3-devel lapack-devel
urpmi xorg-x11-100dpi-fonts; service xfs restart

wget http://cran.r-project.org/src/base/R-2/R-2.1.1.tar.gz
tar zxvf R-*.tar.gz
cd R-*/
./configure
make
make test
make install

# Installing the shared library libRmath (needed to compile some
# third-party programs, such as JAGS)
cd src/nmath/standalone
make
make install
echo /usr/local/lib >> /etc/ld.so.conf
ldconfig
cd ../../../../

# Installing ALL the CRAN packages
wget -r -l 1 -np -nc -x -k -E -U Mozilla -p -R zip www.stats.bris.ac.uk/R/src/contrib/
for i in www.stats.bris.ac.uk/R/src/contrib/*.tar.gz
do
  R CMD INSTALL $i
done

# Installing a few BioConductor packages
urpmi graphviz libglib1.2-devel libgraphviz7-devel
wget -r -l 2 http://www.bioconductor.org/packages/bioc/stable/src/contrib/html/
wget http://www.bioconductor.org/repository/release1.5/package/Source/Rgraphviz_1.5.0.tar.gz
wget http://www.bioconductor.org/repository/release1.5/package/Source/graph_1.5.1.tar.gz
wget http://www.bioconductor.org/repository/release1.5/package/Source/Ruuid_1.5.0.tar.gz
wget http://www.bioconductor.org/repository/release1.5/package/Source/Biobase_1.5.0.tar.gz
R CMD INSTALL Ruuid_1.5.0.tar.gz
R CMD INSTALL Biobase_1.5.0.tar.gz
R CMD INSTALL graph_1.5.1.tar.gz
R CMD INSTALL Rgraphviz_1.5.0.tar.gz
echo /usr/lib/graphviz/ >> /etc/ld.so.conf
ldconfig
ln -s /usr/lib/graphviz/*.so* /usr/local/lib/ # Should not be necessary, but...

# Still a few more...
wget http://www.math.csi.cuny.edu/Statistics/R/simpleR/Simple_0.5.tar.gz
R CMD INSTALL Simple_0.5.tar.gz

Mac OS

R: Documentation

First steps

The following document explains how to use R, progressively, at an elementary level (if the expressions “standard deviation” and “gaussian distribution” do not frighten you, it will be fine).

Another elementary document, with exercises, for German-speaking people.

RTFM

The reference manual is available under R: just type the name of the command whose reference you want prefixed by an interrogation mark.

 > ?round

For the reserved words or the non-alphanumeric commands, use quotes.

 > ?"for"
 > ?"[["

If you do not know the command name (this happens very often), use help.search() or apropos().

 > help.search("stem")
 > apropos("stem")

The first one, help.search(), looks everywhere, especially in all the packages that are not loaded, while the second one, apropos(), looks in the search path, i.e., in all the functions and variables that are currently available.

Some of you might prefer to read HTML (R launches your default web browser – in my case, konqueror).

 > help.start()

If you are connected to the internet, you can also have a look at the R-help mailing list

You can search through it from within R with the RSiteSearch() function.

 > RSiteSearch("density estimation")
 A search query has been submitted to http://search.r-project.org
 The results page should open in your browser shortly

There are also a few PDF files in /usr/lib/R/librarydoc*.tar.gz do

echo Installing $i >> nohup.out
echo Installing $i
nohup R CMD INSTALL $i

done </code>

It might crash for a few libraries: quite often, the problem comes from a missing development package (as for readline, above): the required libraries are there, but only with the files required to run programs using them (the lib*.so files), not those required to compiled programs (the *.h files).

Then, we compute the pictures of the documentation of all the packages.

#! perl -w
use strict;
my $n = 0;
  
# Writing R files
mkdir "Rdoc" || die "Cannot mkdir Rdoc/: $!";
my @libraries= `ls lib/R/library/`;
foreach my $lib (@libraries) {
  chomp($lib);
  print STDERR "Processing library \"$lib\"\n";
  print STDERR `pwd`;
  my @pages = grep /\.R$/, `ls lib/R/library/$lib/R-ex/`;
  chdir "Rdoc" || die "Cannot chdir to Rdoc: $!";
  mkdir "$lib" || die "Cannot mkdir $lib: $!";
  chdir "$lib" || die "Cannot chdir to $lib: $!";
  open(M, '>', "Makefile") || die "Cannot open Makefile for writing: $!";
  print M "all:\n";
  foreach my $page (@pages) {
    chomp($page);
    print STDERR "  Processing man page \"$page\" in library \"$lib\"\n";
    my $res = "";
    $res .= "library($lib)\n";
    $res .= "library(ts)##VZ##\n";
    $res .= "library(lattice)##VZ##\n";
    $res .= "library(nls)##VZ##\n";
    $res .= "library(nlme)##VZ##\n";
    $res .= "library(MASS)##VZ##\n";
    $res .= "identify <- function (...) {}##VZ##\n";
    $res .= "x11()##VZ##\n";
    open(P, '<', "../../lib/R/library/$lib/R-ex/$page") || 
      die "Cannot open lib/R/library/$lib/R-ex/$page for reading: $!";
    # Tag the lines where we must copy the screen (between two commands)
    while(<P>) {
      s/^([^ #}])/try(invisible(dev.print(png,width=600,height=600,bg="white",filename="doc$n.png")))##VZ##\n$1/
        && $n++;
      $res .= $_;
    }
    $res .= "try(invisible(dev.print(png,width=600,height=600,bg=\"white\",filename=\"doc$n.png\")))##VZ##\n"; $n++;
    close P;
    # We discard the line in the following cases:
    #   The previous line ends with a comma, an opening bracket, an "equal" sign
    $res =~ s/[,(=+]\s*\n.*##VZ##.*?\n/,\n/g;
    # The previous line is empty
    $res =~ s/^\s*\n.*##VZ##.*\n/\n/gm;
    # The previous line only contains a comment
    $res =~ s/^(\s*#.*)\n.*##VZ##.*\n/$1\n/gm;
    # The nest line starts with a {    TODO: check (boot / abc.ci)
    $res =~ s/^.*##VZ##.*\n\s*\{/\{/gm;
    # We write the corresponding number
    $res =~ s/doc([0-9]).png/doc00000$1.png/g;
    $res =~ s/doc([0-9][0-9]).png/doc0000$1.png/g;
    $res =~ s/doc([0-9][0-9][0-9]).png/doc000$1.png/g;
    $res =~ s/doc([0-9][0-9][0-9][0-9]).png/doc00$1.png/g;
    $res =~ s/doc([0-9][0-9][0-9][0-9][0-9]).png/doc0$1.png/g;
    open(W, ">", "${lib}_${page}") || die "Cannot open ${lib}_${page} for writing: $!";
    print W $res;
    close W;
    print M "\tR --vanilla <${lib}_${page} >${lib}_${page}.out\n";
    my $p = $page; 
    $p =~ s/\.R$//;
    system 'cp', "../../lib/R/library/$lib/html/$p.html", "${lib}_$p.html";
  }
  print M "\ttouch all\n";
  close(M);
  chdir "../../" || die "Cannot chdir to ../../: $!";
}

we compile them (it should take a few hours: for some packages, it may crash).

cd Rdoc/
for i in *
do
  (
    cd $i
    make
  )
done

I prefer to parallelize all this:

killall xscreensaver
\ls -d */ | perl -p -e 's/(.*)/cd $1; make/' | perl fork.pl 5

where fork.pl allows you to launch several processes at the same time, but not too many:

#! /usr/bin/perl -w
use strict;
my $MAX_PROCESSES = shift || 10;
use Parallel::ForkManager;
my $pm = new Parallel::ForkManager($MAX_PROCESSES);
while(<>){
  my $pid = $pm->start and next; 
  system($_);
  $pm->finish; # Terminates the child process
}

We clean the PNG files thus generated and we write the HTML files,

for i in */
do
  (
    cd $i
    perl ../do.it_2.pl
  )
done

Where do.it_2.pl contains:

#! perl -w
use strict;
  
# Delete the empty or duplicated PNG files
print STDERR "Computing checksums\n";
use Digest::MD5 qw(md5);
my %checksum;
foreach my $f (sort(<*.png>)) {
  if( -z $f ) {
    unlink $f;
    next;
  }
  local $/;
  open(F, '<', $f) || warn "Cannot open $f for reading: $!";
  my $m = md5(<F>);
  close F;
  if( exists $checksum{$m} ){
    unlink $f;
  } else {
    $checksum{$m} = $f;
  }
}
  
# Turn all this into HTML
print STDERR "Converting to HTML\n";
open(HTML, '>', "ALL.html") || warn "Cannot open ALL.html for writing: *!";
select(HTML);
print "<html>\n";
print "<head><title>R</title></head>\n";
print "<body>\n";
foreach my $f (<*.R.out>) {
  my $page = $f;
  $page =~ s/\.R.out$//;
  # Read the initial HTML file
  if( open(H, '<', "$page.html") ){
    my $a = join '', <H>;
    close H;
    $a =~ s#^.*?<body>##gsi;
    $a =~ s#<hr>.*?$##gsi;
    print $a;
  } else {
    warn "Cannot open $page.html for reading: $!";
  }
  open(F, '<', $f) || warn "Cannot open $f for reading: $!";
  #print "<h1>$f</h1>\n";
  print "<h2>Worked out examples</h2>\n";
  print "<pre>\n";
  my $header=1;
  while(<F>) {
    if($header) {
      $header=0 if m/to quit R/;
      next;
    }
    if( m/(doc.*png).*##VZ##/ ){
      my $png = $1;
      next unless -f $png;
      print "</pre>\n";
      print "<img width=600 height=600 src=\"$png\">\n";
      print "<pre>\n";
    }
    next if m/##VZ##/;
    next if m/^>\s*###---/;
    next if m/^>\s*##\s*___/;
    next if m/^>\s*##\s*(alias|keywords)/i;
    s/\&/\&amp;/g;
    s/</&lt;/g;
    print;
  }
  close F;
  print "</pre>\n";
  print "<hr>\n";
}
print "</body>\n";
print "</html>\n";
close HTML;

For an unknown reason, the PNG files have a transparent background: I turn it into a white background with ImageMagick (the white.png file is a white PNG file, of the same size, 600×600, created with The Gimp) – here as well, it is pretty long...

for i in */*png
do
  echo $i
  composite $i white.png $i
done

I do not take into account the potential links in the HTML files (not very clean, but...).

perl -p -i.bak -e 's#<a\s.*?>##gi; s#</a>##gi' **/ALL.html

The HTML files should rather be called index.html:

rename 's/ALL.html/index.html/' */ALL.html

where “rename” is the program:

#!/usr/bin/perl -w
use strict;
my $reg = shift @ARGV;
foreach (@ARGV) {
  my $old = $_;
  eval "$reg;";
  die $@ if $@;
  rename $old, $_;
}

here is a very small piece of the result (2.7Mb, 268 pictures, while the whole documentation reaches 93Mb and 2078 images – remember, this was R 1.6 – it is probably much more now): Rdoc/index.html

In particular, one might be interested in the packages whose documentation contains the highest number of graphics.

% find -name "*.png"  | perl -p -e 's#^\./##;s#/.*##' | sort | uniq -c | sort -n | tail -20
   26 sm
   27 cluster
   29 ade4
   29 nls
   29 spatial
   31 mgcv
   34 car
   38 cobs
   41 MPV
   44 mclust
   45 MASS
   45 vcd
   49 grid
   60 gregmisc
   64 pastecs
   70 qtl
   77 splancs
   88 strucchange
   90 waveslim
  113 spdep

Editor, Graphical interface -- R for non-statisticians and non-programmers

ESS

A page of ESS tips.

It is the statistical mode of Emacs (it may be automatically installed with (X)Emacs: under Mandriva Linux (formerly Mandrake), it is with XEmacs, it is not with Emacs). One may then edit code with automatic indentication and syntax highlighting (Emacs recognises the files from their extension).

You can even run R under Emacs (M+XR).

Windows-specific stuff

Under Windows, R has a graphical interface – it is just a makeshift replacement to accomodate for the lack of a decent terminal under Windows: it is not a menu-driven interface.

You might still need a text editor, though. Some people advise Tinn-R, an R-aware simple text editor: http://www.sciviews.org/Tinn-R/.

R Commander

This really is a menu-oriented interface to R, that allows you to navigate through your data sets, to perform simple statistical analyses,

If you do not need anything beyond regression (but if you need all the graphical diagnostics that should be performed after a regression) and simple tests, or simply, if you have to teach statistics to non-statisticians and non-programmers, it seems to me an excellent choice. Users who later want to unleash the full power of R as a real programming language will face a gentle transition: R Commander displays all the commands that are run in the background to perform the tests and regressions and to produce the plots.

TODO more screenshots...

Zelig

One big problem with R, and one big difference with menu-driven software, is that you never know where the feature you need is – and this is even more true if you do not know the feature in question (you cannot wander through the menus). The typical example is “logistic regression”: the main function to perform logistic regression is “glm”. This stands for “Generalized Linear Models” (do not worry, you are not supposed to know what it is) and the manual page does not even give an example of a logistic regression. If you know the theory behond logistic regression, if you know that it is a special case of GLM, you might be fine – but most users do not know, or do not want to.

The Zelig library fulfils the needs of such people by providing a simplified and uniform interface to most statistical procedures: http://gking.harvard.edu/zelig/.

Basically, you only need to know five functions: zelig() (to fit a model), setx() to change the data (for predictions), sim() to simulate new values (i.e., to predict the values), summary() and plot().

TODO Give an example...

library(Zelig)
d <- data.frame(y = crabs$sp, x1 = crabs$FL, x2 = crabs$RW)
r <- zelig(y ~ x1 + x2, model = "probit", data = d)
summary(r)
op <- par(mfrow = c(2,2), mar = c(4,4,3,2))
plot(r)
par(op)

N <- dim(d)[1]
new.x1 <- rnorm(N, mean(d$x1), sd(d$x1))
new.x2 <- rnorm(N, mean(d$x2), sd(d$x2))
s <- sim(r, setx(r, data = data.frame(x1 = new.x1, x2 = new.x2)))
plot(s)

The model can be:

To predict a quantitative variable:
  ls      Least Squares
  normal  (almost the same)
  gamma   Gamma regression

To predict a binary variable:  
  logit   Logistic regression
  relogit Logistic regressioni for rare events
  probit  Probit regression

To predict two binary variables: 
  blogit  Bivariate logistic regression
  bprobit Bivariate probit regression

To predict a qualitative variable
  mlogit  Multinomial logistic
  
To predict an ordered qualitative variable: 
  ologit  Ordinal logistic regression
  oprobit Ordinal probit regression
 
To predict count data:
  poisson Poisson regression
  negbin  Negative binomial regression (as Poisson regression, but
          with more dispersed observations)

To predict survival data:
  exp      Exponential model (the hazard rate is constant)
  weibull  Weibull model (the hazard rate increases with time)
  lognorm  Log-normal model (the hazard rate increases and then decreases)

The unification of the various statistical models is a great thing, but this library does not put enough emphasis on the graphics to be looked at before, during and after the analysis – as opposed to R Commander.

To do

There are other user-friendly interfaces to R, which I have not had time to investigate (yet).

R: Some elementary functions

In this part, we present the simplest functions of R, to allow you to read data (or to simulate them – it is so easy, you end up doing it all the time, to check if your algorithms are correct (i.e., if they behave as expected if the assumptions you made on your data are satisfied) before applying them to real data) and to numerically or graphically explore them.

In the following, the “>” at the begining of the lines is R’s prompt – you should not type it.

Statistics, computations:

 > # 20 numbers, between 0 and 20
 > # rounded at 1e-1
 > x <- round(runif(20, 0, 20), digits = 1)
 > x
  [1] 10.0  1.6  2.5 15.2  3.1 12.6 19.4  6.1  9.2 10.9  9.5 14.1 14.3 14.3 12.8
 [16] 15.9  0.1 13.1  8.5  8.7
 
 > min(x)
 [1] 0.1
 
 > max(x)
 [1] 19.4
 
 > median(x)  # median
 [1] 10.45
 
 > mean(x)    # mean
 [1] 10.095
 
 > var(x)     # variance
 [1] 27.43734
 
 > sd(x)      # standard deviation 
 [1] 5.238067
 
 > sqrt(var(x))
 [1] 5.238067
 
 > rank(x)    # rank
  [1] 10.0  2.0  3.0 18.0  4.0 12.0 20.0  5.0  8.0 11.0  9.0 15.0 16.5 16.5 13.0
 [16] 19.0  1.0 14.0  6.0  7.0
 
 > sum(x)
 [1] 201.9
 
 > length(x)
 [1] 20
 
 > round(x)
  [1] 10  2  2 15  3 13 19  6  9 11 10 14 14 14 13 16  0 13  8  9
 
 > fivenum(x)     # quantiles
 [1]  0.10  7.30 10.45 14.20 19.40
 
 > quantile(x)    # quantiles (different convention)
    0%   25%   50%   75%  100%
  0.10  7.90 10.45 14.15 19.40
 
 > quantile(x, c(0,.33,.66,1))
     0%    33%    66%   100%
  0.100  8.835 12.962 19.400
 
 > mad(x)         # normalized mean deviation to the median ("median average distance")
 [1] 5.55975
 
 > cummax(x)
  [1] 10.0 10.0 10.0 15.2 15.2 15.2 19.4 19.4 19.4 19.4 19.4 19.4 19.4 19.4 19.4
 [16] 19.4 19.4 19.4 19.4 19.4
 
 > cummin(x)
  [1] 10.0  1.6  1.6  1.6  1.6  1.6  1.6  1.6  1.6  1.6  1.6  1.6  1.6  1.6  1.6
 [16]  1.6  0.1  0.1  0.1  0.1
 
 > cor(x, sin(x/20))  # correlation
 [1] 0.997286

Plot a histogram

x <- rnorm(100)
hist(x, col = "light blue")

Display a scatter plot of two variables

n <- 100
x <- rnorm(n)
y <- x + rnorm(n)
plot(y ~ x)

Add a “regression line” to that scatter plot.

n <- 100
x <- rnorm(n)
y <- x + rnorm(n)
plot(y ~ x)
abline(lm(y ~ x), col = "red")

Print a message or a variable

print("Hello World!")

Concatenate character strings

paste("min:        ", min(x$DS1, na.rm = TRUE)))

Write in a file (or to the screen)

cat("\\end{document}\n", file = "RESULT.tex", append = TRUE)
 
guides/stats-with-r/01intro-to-r.txt · Last modified: 2009/02/20
 
Recent changes RSS feed R Wiki powered by Driven by DokuWiki and optimized for Firefox Creative Commons License