How to convert a three column data frame, array, or table to a matrix?

From Hans-Jörg Bibiko

Given the following data frame df:

df
id feature value
A 2 4
B 4 8
A 3 1
D 1 7
B 2 5

Two columns of df representing row and column of m and the remained column of df the cells of the matirx m.

The output matrix m should be:

m 1 2 3 4
A NA 4 1 NA
B NA 5 NA 8
D 7 NA NA NA

To get m simply type

 > m <- tapply(df[,3] , df[, c(1,2)] , c)

or for named columns

 > m <- tapply( df[,"value"] , df[, c("id","feature")] , c )

or better, just use xtabs

 > xtabs(value ~ id + feature, df)

the reshape package also provides facilities for this and more complicated reshapings.

Explanation:

R-Help:

tapply

Description

Apply a function to each cell of a ragged array, that is to each (non-empty) group of values given by a unique combination of the levels of certain factors.

Usage

tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE)

Arguments

X an atomic object, typically a vector.
INDEX list of factors, each of same length as X.
FUN the function to be applied. In the case of functions like +, %*%, etc., the function name must be quoted. If FUN is NULL, tapply returns a vector which can be used to subscript the multi-way array tapply normally produces.
... ...

For that conversion tapply(X, INDEX, FUN) expects the following arguments:

Xrepresents the vector of the cell values of m.
INDEXvector c=(ROWINDEX, COLUMNINDEX) to specifiy the row and column of m. (hint: rows and columns are sorted)
FUNwhat should be done with each value of X.

In that example with FUN = c tapply() is forced to concatenate each value of m[”id”,”feature”] if there are more than one listed in df.

If there is a need to have only the last value of m[”id”,”feature”], you can type for example:

 > m <- tapply( df[,"value"] , df[, c("id","feature")] , function(x) {return(x[length(x)])} )

As specified in tapply() each cell in m which has no data in df is filled with NA.

If this is a problem you can control that by typing for example:

 > m <- tapply(df[,3] , df[, c(1,2)] , function(x) {ifelse(is.na(x),0,return(x))})

to set a NA in df to 0 in m.

If you want to have full control while converting you can try the following function:

no optimization, code is written straightforward, only for clarification

 
matify<-function(df, na.value = NA, srcRow = 1, srcCol = 2, sortRow = TRUE, sortCol = TRUE){
	
	################
	#
	# df         matrix or data.frame of the dimension n x 3
	# na.value   default value for all cells (NULL will be replaced by NA)
	# srcRow     which column of df specifies the output rows; number or dimname
	# srcCol     which column of df specifies the output columns; number or dimname
	# sortRow    if TRUE sorts the output row names
	# sortCol    if TRUE sorts the output column names
	#
	#
	# date       14 / 07 / 2006
	# author     Hans-Joerg Bibiko
	# mail       bibiko@eva.mpg.de
	#
	#
	# example:
	#   df :=    lgs     feat     value
	#             A       3         4
	#             B       3         1
	#             A       2         5
	#             C       2         3
	#             C       3         9
	#             D       1         3
	#
	#   matify(df, srcRow = "lgs", srcCol = "feat")
	#
	#       1   2   3
	#   A   NA  5   4
	#   B   NA  NA  1
	#   C   NA  3   9
	#   D   3   NA  NA
	#
	################
	
	
	# Error handling: df must have only three colums
	if(ncol(df) != 3) stop("df must have three columns!")
	if(srcCol == srcRow) stop("srcCol == srcRow!")
	
	# Get the missing dimension for filling the cells
	if(is.null(dimnames(df))) { # df has no dimnames
		srcVal <- (1:3)[-c(srcRow, srcCol)]
	} else {                    # df has dimnames
		srcVal <- colnames(df)[
			colnames(df) != ifelse(is.numeric(srcRow), colnames(df)[srcRow], srcRow) & 
			colnames(df) != ifelse(is.numeric(srcCol), colnames(df)[srcCol], srcCol)]
	}
 
	# If na.value is set to NULL, replace it with NA for convience
	if(is.null(na.value)) {
		na.value <- NA
		warning("Set na.value to NA for convience!")
	}
 
	# Warning if na.value occurs in the df column for filling the cells
	if(is.na(na.value)) {
		if(length(df[is.na(df[, srcVal])]) > 0) warning(paste("na.value occurs in df[, ",srcVal,"]!", sep=""))
	} else {
		if(length(df[df[, srcVal] == na.value]) > 0) warning(paste("na.value occurs in df[, ",srcVal,"]!", sep=""))
	}
 
	# Create unique row names
	mrow <- unique(df[, srcRow])
	# Sort the row names?
	if(sortRow) mrow <- sort(mrow)
	
	# Create unique column names
	mcol <- unique(df[, srcCol])
	# Sort the column names?
	if(sortCol) mcol <- sort(mcol)
 
	# Create the output matrix filled with na.value
	m <- matrix(na.value, nrow=length(mrow), ncol=length(mcol), dimnames=list(mrow, mcol))
	
	# Loop through df to fill the cells of the output matrix
	for(i in 1:nrow(df)) m[as.character(df[i, srcRow]), as.character(df[i, srcCol])] <- df[i, srcVal]
	
	return(m)
}
 
tips/data-matrices/convert_table_to_matrix.txt · Last modified: 2009/10/15
 
Recent changes RSS feed R Wiki powered by Driven by DokuWiki and optimized for Firefox Creative Commons License