— Vincent Zoonekynd 2005-08-28
It is a statistical software: contrary to other numerical computation software (Scilab, Octave), it already provides functions to perform non trivial statistical operations, be they classic (regression, logistic regression, analysis of variance (anova), decision trees, principal component analysis, etc.) or more modern (neural networks, bootstrap, generalized additive models (GAM), mixed models, etc.).
However, sometimes, in statistics, we use a lot of signal processing algorithms (Fourrier transform, wavelet transforms, etc.); if this is your case, you might find Scilab or Matlab more appropriate.
It is a real programming language, not a point-and-click software (in French I have a good neologism to describe those “point-and-click” programs: “cliquodromes” – if someone knows of a good English translation, he is welcome): we are not limited by the software designers’ imagination, we can use it to any means. (why don’t you just use “cliquodrome”? everybody speaking English understands it!)
If you really need a simple and powerful GUI, have a look at R Commander, mentionned later in this document. For a list of GUIs, check the using-gui tips.
R is an interpreted language: the advantage, is that we spend less time writing code, the drawback is that the computations are slower – but unless you wish to do real-time computations on the National Insurance files, taking into account the whole British population, it is sufficiently fast.
If speed is really an issue, you can think to write loop intensive code of your custom function in C and then link the library to R by an easy interface function: .C Actually R is as fast as Matlab an SAS in many application, it just need some trick (like in every other language) to achieve such high performance.
May be you’d like to have a look at other efficient software like SAS (commercial, expensive, dating back to the mainframe era), DAP (free but far from complete: it was initially designed as a free replacement for SAS, but turned out to be a C library to perform statistical computations) or you can program everything yourself (in C or C++), with the help of a few libraries.
list of (C, C++, Java) libraries for linear algebra and statistical computations.
or (it is the approach I would advise) you can start to program in R, until you have a slow but correct program, profile your program (it means: find where it spends most of its time), try to change the algorithm so that it be faster and, if it fails, implement in C the most computation-intensive parts. If you try to optimize your program too early, you run the risk of either losing your time by optimizing parts of the implementation that have no real importance on the total time and getting a program with an awkward structure, very difficult to modify or extend – or simply giving up your project before its completion.
There is another problem in R: it tends to load all the data into memory (this is slowly changing, starting with R version 2). For very large data, R might not be the best choice: I recently spoke to a statistician working with genetic data: their data were too large for R, too large for SAS and they had to resort to home-made programs. Yet, I said above, R might be a platform of choice for prototyping, i.e., to write the first versions of an application, while we know neither which algorithms to choose, nor how long the computations will take. Furthermore, as R can access most databases (some, such as PostgreSQL, even allow you to write stored procedures in R), you can store your data there and only retrieve the bits you need, when you need them. Personnaly, the first time I ran into memory problems once, I was playing with a time series containing several minutes of music.
R can produce graphics, but it is not a presentation software: for informative pictures, for graphical data exploration, it is fine, but if you want to use them in an advert, to impress or deceive your customers, you might have to process them through other software – for instance, the title image of this document was created with R, PovRay and Gimp.
Finally, R is a free software (”free” as in “free speech” – and also, incidentally, as in “free beer”). But beware: some (rare) libraries are not free but just “freely useable for non commercial purposes”: be sure to read the licence.
Some people have started to set up R galleries:
You should really consider Linux instead of Windows. People using R on Windows sometimes have problems: especially memory problems (Windows specialists tell me that this operating system becomes unstable when a process wants to use more than 1Gb of memory – and I confirm the problems) and installation problems (if you want to install a package, you will have to find a binary version of it, for the very version of R you have installed – alternatively, you can try to compile the packages from source, but this is very tricky, because Windows lacks all the tools for this: you end up installing Unix utilities on your Windows machine, but you will also run into version problems), anyway there are Windows build of library updated very often.
Not convinced?
> library(fortunes) > fortune("install") Benjamin Lloyd-Hughes: Has anyone had any joy getting the rgdal package to compile under windows? Roger Bivand: The closest anyone has got so far is Hisaji Ono, who used MSYS (http://www.mingw.org/) to build PROJ.4 and GDAL (GDAL depends on PROJ.4, PROJ.4 needs a PATH to metadata files for projection and transformation), and then hand-pasted the paths to the GDAL headers and library into src/Makevars, running Rcmd INSTALL rgdal at the Windows command prompt as usual. All of this can be repeated, but is not portable, and does not suit the very valuable standard binary package build system for Windows. Roughly: [points 1 to 5 etc omitted] Barry Rowlingson: At some point the complexity of installing things like this for Windows will cross the complexity of installing Linux... (PS excepting live-Linux installs like Knoppix) -- Benjamin Lloyd-Hughes, Roger Bivand and Barry Rowlingson R-help (August 2004)
Still not convinced? Well, good luck (you will need it). You can download a Windows version of R from http://cran.r-project.org/bin/windows/base/
[This must have been written a long time ago since not only are there no significant problems with the Windows version of R of the sort described but its particularly easy to install since there is a GUI-based professional installer and once you install it it has a nice multi-window user interface. If you are writing packages such as those that appear on CRAN you will have to download some additional utilities but that is just a one time thing and if you are just developing code to source() which is what most people do, not packages, then you don’t even need to do that. -gg]
I am using Mandriva Linux 2005 LE (formerly known as Mandrake 10.2), but it should be the same with most distributions.
To install a binary version, just type
urpmi R-base R-bioconductor
That is all. No need to roam the web to find where you can download the software, no need to answer lengthy questions about where to install the software, no need to do anything yourself.
I proceed as follows:
# Installing R, from source urpmi gcc-g77 readline-devel libgraphviz7-devel libgmp3-devel urpmi perl-libxml-perl liblapack3-devel lapack-devel urpmi xorg-x11-100dpi-fonts; service xfs restart wget http://cran.r-project.org/src/base/R-2/R-2.1.1.tar.gz tar zxvf R-*.tar.gz cd R-*/ ./configure make make test make install # Installing the shared library libRmath (needed to compile some # third-party programs, such as JAGS) cd src/nmath/standalone make make install echo /usr/local/lib >> /etc/ld.so.conf ldconfig cd ../../../../ # Installing ALL the CRAN packages wget -r -l 1 -np -nc -x -k -E -U Mozilla -p -R zip www.stats.bris.ac.uk/R/src/contrib/ for i in www.stats.bris.ac.uk/R/src/contrib/*.tar.gz do R CMD INSTALL $i done # Installing a few BioConductor packages urpmi graphviz libglib1.2-devel libgraphviz7-devel wget -r -l 2 http://www.bioconductor.org/packages/bioc/stable/src/contrib/html/ wget http://www.bioconductor.org/repository/release1.5/package/Source/Rgraphviz_1.5.0.tar.gz wget http://www.bioconductor.org/repository/release1.5/package/Source/graph_1.5.1.tar.gz wget http://www.bioconductor.org/repository/release1.5/package/Source/Ruuid_1.5.0.tar.gz wget http://www.bioconductor.org/repository/release1.5/package/Source/Biobase_1.5.0.tar.gz R CMD INSTALL Ruuid_1.5.0.tar.gz R CMD INSTALL Biobase_1.5.0.tar.gz R CMD INSTALL graph_1.5.1.tar.gz R CMD INSTALL Rgraphviz_1.5.0.tar.gz echo /usr/lib/graphviz/ >> /etc/ld.so.conf ldconfig ln -s /usr/lib/graphviz/*.so* /usr/local/lib/ # Should not be necessary, but... # Still a few more... wget http://www.math.csi.cuny.edu/Statistics/R/simpleR/Simple_0.5.tar.gz R CMD INSTALL Simple_0.5.tar.gz
The following document explains how to use R, progressively, at an elementary level (if the expressions “standard deviation” and “gaussian distribution” do not frighten you, it will be fine).
Another elementary document, with exercises, for German-speaking people.
The reference manual is available under R: just type the name of the command whose reference you want prefixed by an interrogation mark.
> ?round
For the reserved words or the non-alphanumeric commands, use quotes.
> ?"for" > ?"[["
If you do not know the command name (this happens very often), use help.search() or apropos().
> help.search("stem") > apropos("stem")
The first one, help.search(), looks everywhere, especially in all the packages that are not loaded, while the second one, apropos(), looks in the search path, i.e., in all the functions and variables that are currently available.
Some of you might prefer to read HTML (R launches your default web browser – in my case, konqueror).
> help.start()
If you are connected to the internet, you can also have a look at the R-help mailing list
You can search through it from within R with the RSiteSearch() function.
> RSiteSearch("density estimation") A search query has been submitted to http://search.r-project.org The results page should open in your browser shortly
There are also a few PDF files in /usr/lib/R/librarydoc*.tar.gz do
echo Installing $i >> nohup.out echo Installing $i nohup R CMD INSTALL $i
done </code>
It might crash for a few libraries: quite often, the problem comes from a missing development package (as for readline, above): the required libraries are there, but only with the files required to run programs using them (the lib*.so files), not those required to compiled programs (the *.h files).
Then, we compute the pictures of the documentation of all the packages.
#! perl -w use strict; my $n = 0; # Writing R files mkdir "Rdoc" || die "Cannot mkdir Rdoc/: $!"; my @libraries= `ls lib/R/library/`; foreach my $lib (@libraries) { chomp($lib); print STDERR "Processing library \"$lib\"\n"; print STDERR `pwd`; my @pages = grep /\.R$/, `ls lib/R/library/$lib/R-ex/`; chdir "Rdoc" || die "Cannot chdir to Rdoc: $!"; mkdir "$lib" || die "Cannot mkdir $lib: $!"; chdir "$lib" || die "Cannot chdir to $lib: $!"; open(M, '>', "Makefile") || die "Cannot open Makefile for writing: $!"; print M "all:\n"; foreach my $page (@pages) { chomp($page); print STDERR " Processing man page \"$page\" in library \"$lib\"\n"; my $res = ""; $res .= "library($lib)\n"; $res .= "library(ts)##VZ##\n"; $res .= "library(lattice)##VZ##\n"; $res .= "library(nls)##VZ##\n"; $res .= "library(nlme)##VZ##\n"; $res .= "library(MASS)##VZ##\n"; $res .= "identify <- function (...) {}##VZ##\n"; $res .= "x11()##VZ##\n"; open(P, '<', "../../lib/R/library/$lib/R-ex/$page") || die "Cannot open lib/R/library/$lib/R-ex/$page for reading: $!"; # Tag the lines where we must copy the screen (between two commands) while(<P>) { s/^([^ #}])/try(invisible(dev.print(png,width=600,height=600,bg="white",filename="doc$n.png")))##VZ##\n$1/ && $n++; $res .= $_; } $res .= "try(invisible(dev.print(png,width=600,height=600,bg=\"white\",filename=\"doc$n.png\")))##VZ##\n"; $n++; close P; # We discard the line in the following cases: # The previous line ends with a comma, an opening bracket, an "equal" sign $res =~ s/[,(=+]\s*\n.*##VZ##.*?\n/,\n/g; # The previous line is empty $res =~ s/^\s*\n.*##VZ##.*\n/\n/gm; # The previous line only contains a comment $res =~ s/^(\s*#.*)\n.*##VZ##.*\n/$1\n/gm; # The nest line starts with a { TODO: check (boot / abc.ci) $res =~ s/^.*##VZ##.*\n\s*\{/\{/gm; # We write the corresponding number $res =~ s/doc([0-9]).png/doc00000$1.png/g; $res =~ s/doc([0-9][0-9]).png/doc0000$1.png/g; $res =~ s/doc([0-9][0-9][0-9]).png/doc000$1.png/g; $res =~ s/doc([0-9][0-9][0-9][0-9]).png/doc00$1.png/g; $res =~ s/doc([0-9][0-9][0-9][0-9][0-9]).png/doc0$1.png/g; open(W, ">", "${lib}_${page}") || die "Cannot open ${lib}_${page} for writing: $!"; print W $res; close W; print M "\tR --vanilla <${lib}_${page} >${lib}_${page}.out\n"; my $p = $page; $p =~ s/\.R$//; system 'cp', "../../lib/R/library/$lib/html/$p.html", "${lib}_$p.html"; } print M "\ttouch all\n"; close(M); chdir "../../" || die "Cannot chdir to ../../: $!"; }
we compile them (it should take a few hours: for some packages, it may crash).
cd Rdoc/
for i in *
do
(
cd $i
make
)
done
I prefer to parallelize all this:
killall xscreensaver \ls -d */ | perl -p -e 's/(.*)/cd $1; make/' | perl fork.pl 5
where fork.pl allows you to launch several processes at the same time, but not too many:
#! /usr/bin/perl -w use strict; my $MAX_PROCESSES = shift || 10; use Parallel::ForkManager; my $pm = new Parallel::ForkManager($MAX_PROCESSES); while(<>){ my $pid = $pm->start and next; system($_); $pm->finish; # Terminates the child process }
We clean the PNG files thus generated and we write the HTML files,
for i in */
do
(
cd $i
perl ../do.it_2.pl
)
done
Where do.it_2.pl contains:
#! perl -w use strict; # Delete the empty or duplicated PNG files print STDERR "Computing checksums\n"; use Digest::MD5 qw(md5); my %checksum; foreach my $f (sort(<*.png>)) { if( -z $f ) { unlink $f; next; } local $/; open(F, '<', $f) || warn "Cannot open $f for reading: $!"; my $m = md5(<F>); close F; if( exists $checksum{$m} ){ unlink $f; } else { $checksum{$m} = $f; } } # Turn all this into HTML print STDERR "Converting to HTML\n"; open(HTML, '>', "ALL.html") || warn "Cannot open ALL.html for writing: *!"; select(HTML); print "<html>\n"; print "<head><title>R</title></head>\n"; print "<body>\n"; foreach my $f (<*.R.out>) { my $page = $f; $page =~ s/\.R.out$//; # Read the initial HTML file if( open(H, '<', "$page.html") ){ my $a = join '', <H>; close H; $a =~ s#^.*?<body>##gsi; $a =~ s#<hr>.*?$##gsi; print $a; } else { warn "Cannot open $page.html for reading: $!"; } open(F, '<', $f) || warn "Cannot open $f for reading: $!"; #print "<h1>$f</h1>\n"; print "<h2>Worked out examples</h2>\n"; print "<pre>\n"; my $header=1; while(<F>) { if($header) { $header=0 if m/to quit R/; next; } if( m/(doc.*png).*##VZ##/ ){ my $png = $1; next unless -f $png; print "</pre>\n"; print "<img width=600 height=600 src=\"$png\">\n"; print "<pre>\n"; } next if m/##VZ##/; next if m/^>\s*###---/; next if m/^>\s*##\s*___/; next if m/^>\s*##\s*(alias|keywords)/i; s/\&/\&/g; s/</</g; print; } close F; print "</pre>\n"; print "<hr>\n"; } print "</body>\n"; print "</html>\n"; close HTML;
For an unknown reason, the PNG files have a transparent background: I turn it into a white background with ImageMagick (the white.png file is a white PNG file, of the same size, 600×600, created with The Gimp) – here as well, it is pretty long...
for i in */*png do echo $i composite $i white.png $i done
I do not take into account the potential links in the HTML files (not very clean, but...).
perl -p -i.bak -e 's#<a\s.*?>##gi; s#</a>##gi' **/ALL.html
The HTML files should rather be called index.html:
rename 's/ALL.html/index.html/' */ALL.html
where “rename” is the program:
#!/usr/bin/perl -w
use strict;
my $reg = shift @ARGV;
foreach (@ARGV) {
my $old = $_;
eval "$reg;";
die $@ if $@;
rename $old, $_;
}
here is a very small piece of the result (2.7Mb, 268 pictures, while the whole documentation reaches 93Mb and 2078 images – remember, this was R 1.6 – it is probably much more now): Rdoc/index.html
In particular, one might be interested in the packages whose documentation contains the highest number of graphics.
% find -name "*.png" | perl -p -e 's#^\./##;s#/.*##' | sort | uniq -c | sort -n | tail -20 26 sm 27 cluster 29 ade4 29 nls 29 spatial 31 mgcv 34 car 38 cobs 41 MPV 44 mclust 45 MASS 45 vcd 49 grid 60 gregmisc 64 pastecs 70 qtl 77 splancs 88 strucchange 90 waveslim 113 spdep
A page of ESS tips.
It is the statistical mode of Emacs (it may be automatically installed with (X)Emacs: under Mandriva Linux (formerly Mandrake), it is with XEmacs, it is not with Emacs). One may then edit code with automatic indentication and syntax highlighting (Emacs recognises the files from their extension).
You can even run R under Emacs (M+XR).
Under Windows, R has a graphical interface – it is just a makeshift replacement to accomodate for the lack of a decent terminal under Windows: it is not a menu-driven interface.
You might still need a text editor, though. Some people advise Tinn-R, an R-aware simple text editor: http://www.sciviews.org/Tinn-R/.
This really is a menu-oriented interface to R, that allows you to navigate through your data sets, to perform simple statistical analyses,
If you do not need anything beyond regression (but if you need all the graphical diagnostics that should be performed after a regression) and simple tests, or simply, if you have to teach statistics to non-statisticians and non-programmers, it seems to me an excellent choice. Users who later want to unleash the full power of R as a real programming language will face a gentle transition: R Commander displays all the commands that are run in the background to perform the tests and regressions and to produce the plots.
more screenshots...
One big problem with R, and one big difference with menu-driven software, is that you never know where the feature you need is – and this is even more true if you do not know the feature in question (you cannot wander through the menus). The typical example is “logistic regression”: the main function to perform logistic regression is “glm”. This stands for “Generalized Linear Models” (do not worry, you are not supposed to know what it is) and the manual page does not even give an example of a logistic regression. If you know the theory behond logistic regression, if you know that it is a special case of GLM, you might be fine – but most users do not know, or do not want to.
The Zelig library fulfils the needs of such people by providing a simplified and uniform interface to most statistical procedures: http://gking.harvard.edu/zelig/.
Basically, you only need to know five functions: zelig() (to fit a model), setx() to change the data (for predictions), sim() to simulate new values (i.e., to predict the values), summary() and plot().
Give an example...
library(Zelig) d <- data.frame(y = crabs$sp, x1 = crabs$FL, x2 = crabs$RW) r <- zelig(y ~ x1 + x2, model = "probit", data = d) summary(r) op <- par(mfrow = c(2,2), mar = c(4,4,3,2)) plot(r) par(op)
N <- dim(d)[1] new.x1 <- rnorm(N, mean(d$x1), sd(d$x1)) new.x2 <- rnorm(N, mean(d$x2), sd(d$x2)) s <- sim(r, setx(r, data = data.frame(x1 = new.x1, x2 = new.x2))) plot(s)
The model can be:
To predict a quantitative variable:
ls Least Squares
normal (almost the same)
gamma Gamma regression
To predict a binary variable:
logit Logistic regression
relogit Logistic regressioni for rare events
probit Probit regression
To predict two binary variables:
blogit Bivariate logistic regression
bprobit Bivariate probit regression
To predict a qualitative variable
mlogit Multinomial logistic
To predict an ordered qualitative variable:
ologit Ordinal logistic regression
oprobit Ordinal probit regression
To predict count data:
poisson Poisson regression
negbin Negative binomial regression (as Poisson regression, but
with more dispersed observations)
To predict survival data:
exp Exponential model (the hazard rate is constant)
weibull Weibull model (the hazard rate increases with time)
lognorm Log-normal model (the hazard rate increases and then decreases)
The unification of the various statistical models is a great thing, but this library does not put enough emphasis on the graphics to be looked at before, during and after the analysis – as opposed to R Commander.
There are other user-friendly interfaces to R, which I have not had time to investigate (yet).
In this part, we present the simplest functions of R, to allow you to read data (or to simulate them – it is so easy, you end up doing it all the time, to check if your algorithms are correct (i.e., if they behave as expected if the assumptions you made on your data are satisfied) before applying them to real data) and to numerically or graphically explore them.
In the following, the “>” at the begining of the lines is R’s prompt – you should not type it.
Statistics, computations:
> # 20 numbers, between 0 and 20 > # rounded at 1e-1 > x <- round(runif(20, 0, 20), digits = 1) > x [1] 10.0 1.6 2.5 15.2 3.1 12.6 19.4 6.1 9.2 10.9 9.5 14.1 14.3 14.3 12.8 [16] 15.9 0.1 13.1 8.5 8.7 > min(x) [1] 0.1 > max(x) [1] 19.4 > median(x) # median [1] 10.45 > mean(x) # mean [1] 10.095 > var(x) # variance [1] 27.43734 > sd(x) # standard deviation [1] 5.238067 > sqrt(var(x)) [1] 5.238067 > rank(x) # rank [1] 10.0 2.0 3.0 18.0 4.0 12.0 20.0 5.0 8.0 11.0 9.0 15.0 16.5 16.5 13.0 [16] 19.0 1.0 14.0 6.0 7.0 > sum(x) [1] 201.9 > length(x) [1] 20 > round(x) [1] 10 2 2 15 3 13 19 6 9 11 10 14 14 14 13 16 0 13 8 9 > fivenum(x) # quantiles [1] 0.10 7.30 10.45 14.20 19.40 > quantile(x) # quantiles (different convention) 0% 25% 50% 75% 100% 0.10 7.90 10.45 14.15 19.40 > quantile(x, c(0,.33,.66,1)) 0% 33% 66% 100% 0.100 8.835 12.962 19.400 > mad(x) # normalized mean deviation to the median ("median average distance") [1] 5.55975 > cummax(x) [1] 10.0 10.0 10.0 15.2 15.2 15.2 19.4 19.4 19.4 19.4 19.4 19.4 19.4 19.4 19.4 [16] 19.4 19.4 19.4 19.4 19.4 > cummin(x) [1] 10.0 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 [16] 1.6 0.1 0.1 0.1 0.1 > cor(x, sin(x/20)) # correlation [1] 0.997286
Plot a histogram
x <- rnorm(100) hist(x, col = "light blue")
Display a scatter plot of two variables
n <- 100 x <- rnorm(n) y <- x + rnorm(n) plot(y ~ x)
Add a “regression line” to that scatter plot.
n <- 100 x <- rnorm(n) y <- x + rnorm(n) plot(y ~ x) abline(lm(y ~ x), col = "red")
Print a message or a variable
print("Hello World!")
Concatenate character strings
paste("min: ", min(x$DS1, na.rm = TRUE)))
Write in a file (or to the screen)
cat("\\end{document}\n", file = "RESULT.tex", append = TRUE)