Merge data frames

Paul Johnson 2004/04/23, adapted to R Wiki with permission by Nick Drew 2006/02/11 :N:

Purpose – Examples showing how to merge data frames. Demonstrate how to use the merge function: Merge two data frames by common columns or row names, or do other versions of database “join” operations (from help manual). Or use other means, e.g. subscripting, to do the same.

Appending data frames

x1 <- rnorm(100)
x2 <- rnorm(100)
x3 <- rnorm(100)
x4 <- rnorm(100)
ds1 <- data.frame(city = rep(1, 100), x1 = x1, x2 = x2)
ds2 <- data.frame(city = rep(2, 100), x1 = x1, x3 = x3, x4 = x4)
merge(ds1, ds2, by.x = c("city", "x1"), by.y = c("city", "x1"), all = TRUE)

by.y = c("city", "x1") helps R understand which are the “common variables” in the two datasets so it lines them up, and then all = TRUE is needed to say that you don’t want to throw away the variables that are only in one set or the other. The help page for merge provides more details.

Note that this could be simplified to:

merge(ds1, ds2, by = c("city", "x1"), all = TRUE)

Use a data frame as a look up table

experiment <- data.frame(times = c(0, 0, 10, 10, 20, 20, 30, 30),
    expval = c(1, 1, 2, 2, 3, 3, 4, 4))
simul <- data.frame(times = c(0, 10, 20, 30), simul = c(3, 4, 5, 6))
merge(experiment, simul)

produces

  times expval simul
1     0      1     3
2     0      1     3
3    10      2     4
4    10      2     4
5    20      3     5
6    20      3     5
7    30      4     6
8    30      4     6

(from Brian D. Ripley)

An example using subscripting

exp.sim <- data.frame(experiment, simul =
    simul$simul[match(experiment$times, simul$times)])

(from Jim Lemon)

Another example using subscripting

Nick Drew 2006/02/08, copied from R-help Jeffrey Moore asks (2006-02-10):

Is there a way to relate 2 different arrays in R? Goal is to fill column z in data2 (below) with appropriate z-values from data1 that correspond to a given ID. I’m looking for something akin to a relational database, or a lookup table in Excel.

data1:

ID	z
1	100
2	250
3	75
4	12
5	89

data2:

ID	z
1	
1	
1	
1	
2	
3	
4	
3	
4	
5	
5	
5	

John Fox replies:

Assuming that the column named z in the matrix data2 already exists and has arbitrary content (such as 0’s or NA’s), how about the following?
 > data2[ ,"z"] <- data1[data2[ ,"ID"], "z"]
 > data2
 ID   z
 1  100
 1  100
 1  100
 1  100
 2  250
 3   75
 4   12
 3   75
 4   12
 5   89
 5   89
 5   89

Data Frame with matrix lookup (multi-dimensional lookup)

> popassign.mat
        Individual
Selection  IND0071 IND0072 IND0073 IND0074 IND0075 IND0076 IND0077 IND0078
  SELECT12    TRUE    TRUE    TRUE    TRUE   FALSE   FALSE   FALSE    TRUE
  SELECT13   FALSE   FALSE    TRUE   FALSE   FALSE   FALSE    TRUE   FALSE
  SELECT14   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE
  SELECT15    TRUE    TRUE    TRUE    TRUE    TRUE    TRUE   FALSE    TRUE
  SELECT16   FALSE    TRUE   FALSE    TRUE    TRUE    TRUE   FALSE   FALSE
  SELECT17    TRUE   FALSE    TRUE   FALSE   FALSE    TRUE   FALSE   FALSE
> rselect.df
  Individual Selection
1    IND0072  SELECT12
2    IND0072  SELECT15
3    IND0076  SELECT14
4    IND0078  SELECT12
5    IND0076  SELECT17
6    IND0071  SELECT16
> rselect.df$inPop1 <- popassign.mat[cbind(rselect.df$Selection,rselect.df$Individual)]

Merge multiple data.frames

FIXME This section may contain inaccurate information.

Aim: you want to merge more than 2 data.frames with the same call to merge.

Four options: a for() loop (or explicit iteration); using Reduce; using a recursive call with Recall() or the merge_recurse() from package reshape.

  1. for loop (and explicit version)
DF1 <- data.frame(var1 = letters[1:5], a = rnorm(5))
DF2 <- data.frame(var1 = letters[3:7], b = rnorm(5))
DF3 <- data.frame(var1 = letters[6:10], c = rnorm(5))
DF4 <- data.frame(var1 = letters[8:12], d = rnorm(5))
 
g <- merge(DF1, DF2, by.x="var1", by.y="var1", all=T)
g <- merge(g, DF3, by.x="var1", by.y="var1", all=T)
g <- merge(g, DF4, by.x="var1", by.y="var1", all=T)
 
# or 
my.list <- list(DF1, DF2, DF3, DF4)
 
DF <- DF1
for ( .df in my.list ) {
  DF <-merge(DF,.df,by.x="var1", by.y="var1", all=T)
 }
  1. Reduce
 
Reduce(function(x, y) merge(x, y, all=T,by.x="var1", by.y="var1"), 
my.list, accumulate=F)
  1. Recall
merge.rec <- function(.list, ...){
	if(length(.list)==1) return(.list[[1]])
	Recall(c(list(merge(.list[[1]], .list[[2]], ...)), .list[-(1:2)]), ...)
}
merge.rec(my.list, by.x="var1", by.y="var1", all=T, suffixes=c("", ""))
  1. merge_recurse() from reshape

The reshape package provides the functions merge_recurse and merge_all to perform this task (essentially a wrapper around the recall solution above).

merge_recurse(my.list)

Speed comparison: the following test shows that merge_all() is the most efficient technique proposed here.

df1 <- data.frame(x=rnorm(1000), y=rnorm(1000), z=factor(letters[2:5]))
 
ldf <- lapply(seq(1, 100), function(.) df1)
 
 
system.time(Reduce(function(x, y) merge(x, y, all=T), ldf, accumulate=F))
system.time(merge_all(ldf))
system.time(merge.rec(ldf, all=T))

The zoo package contains merge.zoo which can perform multi-way merges of zoo and ts time series objects.

 
tips/data-frames/merge.txt · Last modified: 2011/08/23 by matifou
 
Recent changes RSS feed R Wiki powered by Driven by DokuWiki and optimized for Firefox Creative Commons License