Factors are R’s data type for storing categorical variables; they are typically unordered but can be ordered too. Factors are stored as a character vector of unique levels and a numeric vector of indices. For instance:

> x <- factor(letters[1:5]) > x [1] a b c d e Levels: a b c d e > str(x) Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5 > as.numeric(x) [1] 1 2 3 4 5 > levels(x) [1] "a" "b" "c" "d" "e"

The vector of levels stores the levels in an order, which influences the interpretation of the numeric indices. The default order of the levels in a factor created from a character type is alphabetical, while in a factor created from a numeric vector is the numeric order.

f <- factor(c("one", "two", "three", "two", "three")) levels(f) # [1] "one" "three" "two" as.integer(f) # [1] 1 3 2 3 2

The levels and their order may be specified explicitly

x <- c("one", "two", "three", "two", "three") f <- factor(x, levels=c("one", "two", "three", "four")) levels(f) # [1] "one" "two" "three" "four" as.integer(f) # [1] 1 2 3 2 3

A factor with unused levels may be useful as an input to function table(). Continuing the previous example, we obtain

f # [1] one two three two three # Levels: one two three four table(f) # f # one two three four # 1 2 2 0

Compare the following

x <- c(1, 2, 3, 11, 12, 13, 101, 102, 103) factor(x) #[1] 1 2 3 11 12 13 101 102 103 #Levels: 1 2 3 11 12 13 101 102 103 factor(as.character(x)) #[1] 1 2 3 11 12 13 101 102 103 #Levels: 1 101 102 103 11 12 13 2 3

Create the levels in the order, in which they first occur in the data

x <- c("one", "two", "three", "two", "three") f <- factor(x, levels=unique(x)) levels(f) # [1] "one" "two" "three" as.integer(f) # [1] 1 2 3 2 3

Some of the tips below are based on suggestions from R-help. See also section 8.2 (”Chimeras”, pp. 68-71) of Patrick Burns’ R Inferno for useful information on dealing with factors.

- Combine some levels of a factor together.

f <- factor(rep(c("Engineer", "Doctor", "Teacher"), times=2)) f # [1] Engineer Doctor Teacher Engineer Doctor Teacher # Levels: Doctor Engineer Teacher combine <- c("Doctor", "Teacher") levels(f)[levels(f) %in% combine] <- paste(abbreviate(combine, 5), collapse = "&") f # [1] Engineer Doctr&Techr Doctr&Techr Engineer Doctr&Techr Doctr&Techr # Levels: Doctr&Techr Engineer

- Safely convert factors containing integers back to numeric:
`as.numeric(as.character(f))`

(**not**`as.numeric(f)`

). For example,

> f <- factor(4:6) > f [1] 4 5 6 Levels: 4 5 6 > as.numeric(f) [1] 1 2 3 > as.character(f) [1] "4" "5" "6" > as.numeric(as.character(f)) [1] 4 5 6

- Reorder factor levels:
`factor(f, levels = ...)`

For example:

> f <- factor(c("A","B","C")) > f [1] A B C Levels: A B C > f <- factor(f,c("B","A","C")) > f [1] A B C Levels: B A C

: **Don’t try to do it by directly manipulating the levels**

> f <- factor(c("A","B","C")) > levels(f) <- c("B","A","C") > f [1] B A C Levels: B A C

- Use factor to enforce a continuous interval of values in function table()

x <- c(1, 2, 3, 6, 7, 7, 9, 10, 10, 11) table(x) #x # 1 2 3 6 7 9 10 11 # 1 1 1 1 2 1 2 1 table(factor(x, levels=1:11)) # 1 2 3 4 5 6 7 8 9 10 11 # 1 1 1 0 0 1 2 0 1 2 1

- Reorder levels according to the value of another variable: see
`?reorder`

(*example?*)

x<-rep(1:3,3) x ## [1] 1 2 3 1 2 3 1 2 3 recode(x, "c(1,2)='A'; else='B'") ## [1] "A" "A" "B" "A" "A" "B" "A" "A" "B" recode(x, "1:2='A'; 3='B'") ## [1] "A" "A" "B" "A" "A" "B" "A" "A" "B"

Note that `recode()`

does string processing on the `recodes`

string, so reassigning or assigning factors with semicolons in their names is difficult ...

- Relabel factors:
`levels(f) <- newlabels`

> f <- factor(c("ALL","NOTHING")) > f [1] ALL NOTHING Levels: ALL NOTHING > levels(f) <- c("All","Nothing") > f [1] All Nothing Levels: All Nothing

- Drop unused levels:
`f[i, drop = TRUE]`

- Frequency of each factor:
`table(f)`

lattice’s shingle generalises the concept of a factor for continuous variables (alternative to cut).

stats:::relevel.factor provides a way to reorder only the first level of a factor. It can be generalised to an arbitrary number of levels (see http://markmail.org/thread/v2qbtwkyu7pi4aia).

In older versions of R, factor was more space efficient than character type, since every value including the repeated ones are stored only once in the levels attribute. The current implementation of character type stores only pointers to a global hash table of strings used in a session, so repeated strings are stored only once also in character type.

Manipulating character vectors using functions like c(), strsplit(), as.numeric() is simpler than similar operations with factors. When applied to a factor directly, these functions work with the vector of numeric indices. Using these functions to the levels has to be explicitly specified, for example

levels <- c(letters[1:3]) x <- factor(c("b", "a", "a"), levels=levels) y <- factor(c("a", "c", "a"), levels=levels) c(x, y) # [1] 2 1 1 1 3 1 factor(c(as.character(x), as.character(y)), levels=levels) #[1] b a a a c a #Levels: a b c