— Keith Chamberlain 2010/04/26
On the R-Help listserv, questions about working with large files are common. A “large file” may be smaller than many would expect, and an amount of data that may be handled fine for some operations may be too much for other operations.
There is a reason why it can be tricky to work with some files or analyses when the data are above some unspecified size. It turns out that R handles data differently than other statistical packages, such as SAS and SPSS. Rather than load objects (or parts of objects) into memory when needed, R loads entire objects (e.g. data file) into the workspace at once. This kind of memory architecture was a deliberate design choice for the software, but is disadvantageous when working with data sets that exceed system memory, such as in the case of the so-called “out of memory regression.” Normally, one starts with a question to ask of their data. The entire data set is loaded into the R workspace, perhaps after a successful call to read.table(). Models are fit, and hopefully the question is answered.
With large scale data there are more steps involved. The question being asked of the data must be restructured. In some cases, small modifications to the R environment may be all that is needed to compensate for the extra data. For some analysis problems, a 'divide and conquer' approach may work. For other analysis problems, ‘divide and conquer’ may not be straight forward or practical, and may not yet be possible. The purpose of this resource is to present some options to the R-user for handling large scale data, and provide a clearinghouse for instruction and examples related to data I/O and large files in R, databases, and the fantastic add-on packages available. The aim is to focus on topics that result in cross-platform solutions, though there are exceptions.
Large scale data. Any amount of data that, due to the amount, requires an R-user to reformulate how questions of data are answered in order to compensate.
It is important to have read the manuals: R Data Import/Export and An Introduction to R (External Link). A basic familiarity with connections and the read/write functions of packages base & utils is strongly recommended. Large files, however, are not required.
Readers looking for reading and writing file formats from applications other than R, such as SPSS, SAS, Stata, and others, should consult the sufficient documentation for the foreign package, and UCLA: Academic Technology Services, Statistical Consulting Group, and these wiki namespaces:
A basic approach to handling large problems is to break them down into smaller parts.
The same basic principle applies for many, (but not all) statistical problems with large scale data, though it may not be straight forward. As a result, some details that were handled automatically when all of the data fit into memory may need to be treated explicitly.
Pre-processing, such as sorting, may be needed to set up the problem, while post-processing may be required to update model fits based on new information.
The processing may include:
There are several useful add-on packages for large scale data and out of memory regression. Hopefully, over time, this tutorial will expand sufficiently to do justice to those topics. The coverage of scan() will provide for a common language and conceptual reference for handling add-on packages and databases.