Working with Large Scale Data in R

Keith Chamberlain 2010/04/26

Introduction

On the R-Help listserv, questions about working with large files are common. A “large file” may be smaller than many would expect, and an amount of data that may be handled fine for some operations may be too much for other operations.

There is a reason why it can be tricky to work with some files or analyses when the data are above some unspecified size. It turns out that R handles data differently than other statistical packages, such as SAS and SPSS. Rather than load objects (or parts of objects) into memory when needed, R loads entire objects (e.g. data file) into the workspace at once. This kind of memory architecture was a deliberate design choice for the software, but is disadvantageous when working with data sets that exceed system memory, such as in the case of the so-called “out of memory regression.” Normally, one starts with a question to ask of their data. The entire data set is loaded into the R workspace, perhaps after a successful call to read.table(). Models are fit, and hopefully the question is answered.

With large scale data there are more steps involved. The question being asked of the data must be restructured. In some cases, small modifications to the R environmentunfold hidden content may be all that is needed to compensate for the extra data. For some analysis problems, a 'divide and conquer'unfold hidden content approach may work. For other analysis problems, ‘divide and conquer’ may not be straight forward or practical, and may not yet be possible. The purpose of this resource is to present some options to the R-user for handling large scale data, and provide a clearinghouse for instruction and examples related to data I/O and large files in R, databases,unfold hidden content and the fantastic add-on packages available. The aim is to focus on topics that result in cross-platform solutions, though there are exceptions.

Large scale data. Any amount of data that, due to the amount, requires an R-user to reformulate how questions of data are answered in order to compensate.

Requisites

It is important to have read the manuals: R Data Import/Export and An Introduction to R (External Link). A basic familiarity with connections and the read/write functions of packages base & utils is strongly recommended. Large files, however, are not required.

?connection

What is not covered

Readers looking for reading and writing file formats from applications other than R, such as SPSS, SAS, Stata, and others, should consult the sufficient documentation for the foreign package, and UCLA: Academic Technology Services, Statistical Consulting Group, and these wiki namespaces:

  • Exchanging data between R and MS Windows apps (ms_windows)
  • Translations between software packages (translations)

Approach

A basic approach to handling large problems is to break them down into smaller parts.

The same basic principle applies for many, (but not all) statistical problems with large scale data, though it may not be straight forward. As a result, some details that were handled automatically when all of the data fit into memory may need to be treated explicitly.

Pre-processing, such as sorting, may be needed to set up the problem, while post-processing may be required to update model fits based on new information.

The processing may include:

  • sequential windows of a data file
  • overlapped windows
  • zero padded to enable fast computations (e.g. such as FFT)
  • sorted via quicksort (in memory)
  • sorted via mergesort (out of memory)
  • importing chunks based on levels of grouping present in the data

There are several useful add-on packages for large scale data and out of memory regression. Hopefully, over time, this tutorial will expand sufficiently to do justice to those topics. The coverage of scan() will provide for a common language and conceptual reference for handling add-on packages and databases.

Topics

Memoryunfold hidden content

Output (Data export)unfold hidden content

Input (Data import)unfold hidden content

SQL in Runfold hidden content

NoSQL in Runfold hidden content

Package/package combination's (beyond 'just fits')unfold hidden content

Reviews/comparisonsunfold hidden content

Table & topics likely to change frequently.

Contributors.unfold hidden content

Acknowledgments

 
large_scale_data.txt · Last modified: 2012/02/27 by kchamberln
 
Recent changes RSS feed R Wiki powered by Driven by DokuWiki and optimized for Firefox Creative Commons License