Statistical Analysis: an Introduction using R/R/Factors

Categorical variables in R are stored as a special vector object known as a factor. This is not the same as a character vector filled with a set of names (don't get the two mixed up). In particular, R has to be told that each element can only be one of a number of known levels (e.g. Male or Female). If you try to place a data point with a different, unknown level into the factor, R will complain. When you print a factor to the screen, R will also list the possible levels that factor can take (this may include ones that aren't present)

The factor() function creates a factor and defines the available levels. By default the levels are taken from the ones in the vector***. Actually, you don't often need to use factor(), because when reading data in from a file, R assumes by default that text should be converted to factors (see Statistical Analysis: an Introduction using R/R/R/Data frames). You may need to use as.factor(). Internally, R stores the levels as numbers from 1 upwards, but it is not always obvious which number corresponds to which level, and it should not normally be necessary to know.

Ordinal variables, that is factors in which the levels have a natural order, are known to R as ordered factors. They can be created in the normal way a factor is created, but in addition specifying ordered=TRUE.
Input:
state.region #An example of a factor: note that the levels are printed out

state.name #this is *NOT* a factor state.name[1] <- "Any text" #you can replace text in a character vector state.region[1] <- "Any text" #but you can't in a factor state.region[1] <- "South" #this is OK state.abb #this is not a factor, just a character vector

character.vector <- c("Female", "Female", "Male", "Male", "Male", "Female", "Female", "Male", "Male", "Male", "Male", "Male", "Female", "Female" , "Male", "Female", "Female", "Male", "Male", "Male", "Male", "Female", "Female", "Female", "Female", "Male", "Male", "Male", "Female" , "Male", "Female", "Male", "Male", "Male", "Male", "Male", "Female", "Male", "Male", "Male", "Male", "Female", "Female", "Female") #a bit tedious to do all that typing
  1. might be easier to use codes, e.g. 1 for female and 2 for male

Coded <- factor(c(1, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 1, 1)) Gender <- factor(Coded, labels=c("Female", "Male")) #we can then convert this to named levels

Result: