— Thomas Wutzler 2006/09/14
Using the built-in dataset airquality, first load the data and check what variables it contains:
data(airquality) names(airquality)
Among ways to subset rows of a dataframe in S language, there are two usual approaches. Either you delete rows by extracting all other rows of the dataframe using a vector of logical values, or you remove these rows using a vector of negative indices.
Assume that days 5 and 7 in May of the airquality measurements are outliers and you want to repeat an analysis without these rows. You write:
length(airquality$Day) airquality2 <- subset(airquality, !(Day %in% c(5, 7) & Month == 5)) length(airquality2$Day)
Similarly, you can delete specific rows. In order to delete lines 2 and 7, you write:
length(airquality$Day) airquality3 <- airquality[-c(2, 7), ] length(airquality3$Day)
— Claudia Beleites 2008/01/02
Be careful with logical versus numeric index vectors:
new.data <- data[!outliers, ] # logical indices new.data <- data[-outliers, ] # numeric indices
Using the numeric form for a logical index vector will delete the first row only:
> data <- 1:10 > outliers <- data %in% 3:7 > new.data <- data[!outliers] # (desired effect) > new.data [1] 1 2 8 9 10 > new.data <- data[-outliers] # (wrong code) > new.data [1] 2 3 4 5 6 7 8 9 10