[R] Scatter plot - using colour to group points?

Ian Robertson igr at stanford.edu
Wed Nov 23 18:08:30 CET 2011


Hello all,

Yesterday I wrote Michael Weylandt to ask for some help in understanding 
a line of code he used responding to SarahH's query about controlling 
colours in scatter plots. He wrote an excellent explanation that 
deserves to be shared here. Below I include the code I wrote while 
experimenting with the problem (indicating the specific line of code I 
asked him about) followed by Michael's thoughtful reply.

Saludos - Ian

-- 
Ian G. Robertson
Department of Anthropology
Building 50, 450 Serra Mall
Stanford University, CA 94305-2034
e:    igr at stanford.edu

#the code:
##########################################
x1 <- rnorm(13)
y1 <- rnorm(13)

#these two lines from R. Michael Weylandt
X = letters[c(1,2,3,3,1,2,1,3,3,1,2,2,1)]
colX = c("red","green","blue")[as.factor(X)] #?? How does this work? Ask RMW

table(colX)
plot(x1, y1, col=colX, pch=20, cex=2)
##########################################
#Michael Weylandt's explanation:

In short, there are two key bits to follow:

1) What happens when you "factorize" something -- R stores factors
internally as integers with special labels and a few special behaviors
for some calculations that won't come up here: the labels aren't so
important for our purpose, but the key is that each unique value of X
gets assigned to its own factor. The order that they appear in X
corresponds to the integers they get, not their "real" values (if they
were already integers or doubles). As a side point this means that
floating point trouble can sometimes show up so if you want to bin
real numbers, it's safer to use cut() for the factoring step.

2) What happens when you use a factor to subset -- R simply tosses out
the "factor"-ness and only uses the internal integer representation.
If we wanted to be more explicit, we'd write
colVec[as.integer(as.factor(X))] but the as.integer happens
automatically.

So the whole path is: assign integers to each unique value of X and
subset by those integers: if there are as many unique values as there
are elements of the color vector, the end result is a direct matching:
if there are too many, it throws and error: too few and some colors go
unused:

something like:

col("red","green","blue")[as.factor(letters[1:4])] ## ERROR

col("red","green","blue")[as.factor(letters[1:2])] ## blue not used.

Hope this helps,

Michael



More information about the R-help mailing list