— Paul Johnson 2004/04/23, adapted to R Wiki with permission by Nick Drew 2006/02/11
Purpose – Examples showing how to merge data frames. Demonstrate how to use the merge function: Merge two data frames by common columns or row names, or do other versions of database “join” operations (from help manual). Or use other means, e.g. subscripting, to do the same.
x1 <- rnorm(100) x2 <- rnorm(100) x3 <- rnorm(100) x4 <- rnorm(100) ds1 <- data.frame(city = rep(1, 100), x1 = x1, x2 = x2) ds2 <- data.frame(city = rep(2, 100), x1 = x1, x3 = x3, x4 = x4) merge(ds1, ds2, by.x = c("city", "x1"), by.y = c("city", "x1"), all = TRUE)
by.y = c("city", "x1") helps R understand which are the “common variables” in the two datasets so it lines them up, and then all = TRUE is needed to say that you don’t want to throw away the variables that are only in one set or the other. The help page for merge provides more details.
Note that this could be simplified to:
merge(ds1, ds2, by = c("city", "x1"), all = TRUE)
experiment <- data.frame(times = c(0, 0, 10, 10, 20, 20, 30, 30), expval = c(1, 1, 2, 2, 3, 3, 4, 4)) simul <- data.frame(times = c(0, 10, 20, 30), simul = c(3, 4, 5, 6)) merge(experiment, simul)
produces
times expval simul 1 0 1 3 2 0 1 3 3 10 2 4 4 10 2 4 5 20 3 5 6 20 3 5 7 30 4 6 8 30 4 6
(from Brian D. Ripley)
exp.sim <- data.frame(experiment, simul = simul$simul[match(experiment$times, simul$times)])
(from Jim Lemon)
— Nick Drew 2006/02/08, copied from R-help Jeffrey Moore asks (2006-02-10):
Is there a way to relate 2 different arrays in R? Goal is to fill columnzindata2(below) with appropriate z-values fromdata1that correspond to a given ID. I’m looking for something akin to a relational database, or a lookup table in Excel.
data1:
ID z 1 100 2 250 3 75 4 12 5 89
data2:
ID z 1 1 1 1 2 3 4 3 4 5 5 5
John Fox replies:
Assuming that the column namedzin the matrixdata2already exists and has arbitrary content (such as 0’s or NA’s), how about the following?
> data2[ ,"z"] <- data1[data2[ ,"ID"], "z"] > data2 ID z 1 100 1 100 1 100 1 100 2 250 3 75 4 12 3 75 4 12 5 89 5 89 5 89
> popassign.mat
Individual
Selection IND0071 IND0072 IND0073 IND0074 IND0075 IND0076 IND0077 IND0078
SELECT12 TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE
SELECT13 FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE
SELECT14 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
SELECT15 TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
SELECT16 FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE
SELECT17 TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
> rselect.df Individual Selection 1 IND0072 SELECT12 2 IND0072 SELECT15 3 IND0076 SELECT14 4 IND0078 SELECT12 5 IND0076 SELECT17 6 IND0071 SELECT16
> rselect.df$inPop1 <- popassign.mat[cbind(rselect.df$Selection,rselect.df$Individual)]
This section may contain inaccurate information.
Aim: you want to merge more than 2 data.frames with the same call to merge.
Four options: a for() loop (or explicit iteration); using Reduce; using a recursive call with Recall() or the merge_recurse() from package reshape.
DF1 <- data.frame(var1 = letters[1:5], a = rnorm(5)) DF2 <- data.frame(var1 = letters[3:7], b = rnorm(5)) DF3 <- data.frame(var1 = letters[6:10], c = rnorm(5)) DF4 <- data.frame(var1 = letters[8:12], d = rnorm(5)) g <- merge(DF1, DF2, by.x="var1", by.y="var1", all=T) g <- merge(g, DF3, by.x="var1", by.y="var1", all=T) g <- merge(g, DF4, by.x="var1", by.y="var1", all=T) # or my.list <- list(DF1, DF2, DF3, DF4) DF <- DF1 for ( .df in my.list ) { DF <-merge(DF,.df,by.x="var1", by.y="var1", all=T) }
Reduce(function(x, y) merge(x, y, all=T,by.x="var1", by.y="var1"), my.list, accumulate=F)
merge.rec <- function(.list, ...){ if(length(.list)==1) return(.list[[1]]) Recall(c(list(merge(.list[[1]], .list[[2]], ...)), .list[-(1:2)]), ...) } merge.rec(my.list, by.x="var1", by.y="var1", all=T, suffixes=c("", ""))
The reshape package provides the functions merge_recurse and merge_all to perform this task (essentially a wrapper around the recall solution above).
merge_recurse(my.list)
Speed comparison: the following test shows that merge_all() is the most efficient technique proposed here.
df1 <- data.frame(x=rnorm(1000), y=rnorm(1000), z=factor(letters[2:5])) ldf <- lapply(seq(1, 100), function(.) df1) system.time(Reduce(function(x, y) merge(x, y, all=T), ldf, accumulate=F)) system.time(merge_all(ldf)) system.time(merge.rec(ldf, all=T))
The zoo package contains merge.zoo which can perform multi-way merges of zoo and ts time series objects.