Factors

Factors are R’s data type for storing categorical variables; they are typically unordered but can be ordered too. Factors are stored as a character vector of unique levels and a numeric vector of indices. For instance:

 > x <- factor(letters[1:5])
 > x
 [1] a b c d e
 Levels: a b c d e
 > str(x)
 Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
 > as.numeric(x)
 [1] 1 2 3 4 5
 > levels(x)
 [1] "a" "b" "c" "d" "e"

The vector of levels stores the levels in an order, which influences the interpretation of the numeric indices. The default order of the levels in a factor created from a character type is alphabetical, while in a factor created from a numeric vector is the numeric order.

f <- factor(c("one", "two", "three", "two", "three"))
levels(f) # [1] "one"   "three" "two"
as.integer(f) # [1] 1 3 2 3 2

The levels and their order may be specified explicitly

x <- c("one", "two", "three", "two", "three")
f <- factor(x, levels=c("one", "two", "three", "four"))
levels(f) # [1] "one"   "two"   "three" "four"
as.integer(f) # [1] 1 2 3 2 3

A factor with unused levels may be useful as an input to function table(). Continuing the previous example, we obtain

f
# [1] one   two   three two   three
# Levels: one two three four
table(f)
# f
#   one   two three  four 
#     1     2     2     0 

Compare the following

x <- c(1, 2, 3, 11, 12, 13, 101, 102, 103)
 
factor(x)
#[1] 1   2   3   11  12  13  101 102 103
#Levels: 1 2 3 11 12 13 101 102 103
 
factor(as.character(x))
#[1] 1   2   3   11  12  13  101 102 103
#Levels: 1 101 102 103 11 12 13 2 3

Create the levels in the order, in which they first occur in the data

x <- c("one", "two", "three", "two", "three")
f <- factor(x, levels=unique(x))
levels(f) # [1] "one"   "two"   "three"
as.integer(f) # [1] 1 2 3 2 3

Some of the tips below are based on suggestions from R-help. See also section 8.2 (”Chimeras”, pp. 68-71) of Patrick Burns’ R Inferno for useful information on dealing with factors.

  • Combine some levels of a factor together.
f <- factor(rep(c("Engineer", "Doctor", "Teacher"), times=2))
f
# [1] Engineer Doctor   Teacher  Engineer Doctor   Teacher 
# Levels: Doctor Engineer Teacher
combine <- c("Doctor", "Teacher")
levels(f)[levels(f) %in% combine] <- paste(abbreviate(combine, 5), collapse = "&")
f
# [1] Engineer    Doctr&Techr Doctr&Techr Engineer    Doctr&Techr Doctr&Techr
# Levels: Doctr&Techr Engineer
 > f <- factor(4:6)
 > f
[1] 4 5 6
Levels: 4 5 6
 > as.numeric(f)
[1] 1 2 3
 > as.character(f)
[1] "4" "5" "6"
 > as.numeric(as.character(f))
[1] 4 5 6
  • Reorder factor levels: factor(f, levels = ...) For example:
 > f <- factor(c("A","B","C"))
 > f
[1] A B C
Levels: A B C
 > f <- factor(f,c("B","A","C"))
 > f
[1] A B C
Levels: B A C

::!: Don’t try to do it by directly manipulating the levels

 > f <- factor(c("A","B","C"))
 > levels(f) <- c("B","A","C")
 > f
[1] B A C
Levels: B A C
  • Use factor to enforce a continuous interval of values in function table()
x <- c(1, 2, 3, 6, 7, 7, 9, 10, 10, 11)
table(x)
#x
# 1  2  3  6  7  9 10 11
# 1  1  1  1  2  1  2  1
 
table(factor(x, levels=1:11)) 
# 1  2  3  4  5  6  7  8  9 10 11
# 1  1  1  0  0  1  2  0  1  2  1
  • Reorder levels according to the value of another variable: see ?reorder (example?)
  • Recode factors: recode() from car package; from ?recode:
x<-rep(1:3,3)
x
## [1] 1 2 3 1 2 3 1 2 3
recode(x, "c(1,2)='A'; else='B'")
## [1] "A" "A" "B" "A" "A" "B" "A" "A" "B"
recode(x, "1:2='A'; 3='B'")
## [1] "A" "A" "B" "A" "A" "B" "A" "A" "B"

Note that recode() does string processing on the recodes string, so reassigning or assigning factors with semicolons in their names is difficult ...

  • Relabel factors: levels(f) <- newlabels
 > f <- factor(c("ALL","NOTHING"))
 > f
[1] ALL     NOTHING
Levels: ALL NOTHING
 > levels(f) <- c("All","Nothing")
 > f
[1] All     Nothing
Levels: All Nothing

:id: lattice’s shingle generalises the concept of a factor for continuous variables (alternative to cut).

:id: stats:::relevel.factor provides a way to reorder only the first level of a factor. It can be generalised to an arbitrary number of levels (see http://markmail.org/thread/v2qbtwkyu7pi4aia).

Factor versus character type

In older versions of R, factor was more space efficient than character type, since every value including the repeated ones are stored only once in the levels attribute. The current implementation of character type stores only pointers to a global hash table of strings used in a session, so repeated strings are stored only once also in character type.

Manipulating character vectors using functions like c(), strsplit(), as.numeric() is simpler than similar operations with factors. When applied to a factor directly, these functions work with the vector of numeric indices. Using these functions to the levels has to be explicitly specified, for example

levels <- c(letters[1:3])
x <- factor(c("b", "a", "a"), levels=levels)
y <- factor(c("a", "c", "a"), levels=levels)
c(x, y) # [1] 2 1 1 1 3 1
factor(c(as.character(x), as.character(y)), levels=levels)
#[1] b a a a c a
#Levels: a b c
 
tips/data-factors/factors.txt · Last modified: 2011/01/18 by psavicky
 
Recent changes RSS feed R Wiki powered by Driven by DokuWiki and optimized for Firefox Creative Commons License