[R] Smart Indexing

Thaler, Thorn, LAUSANNE, Applied Mathematics Thorn.Thaler at rdls.nestle.com
Mon Aug 9 11:01:14 CEST 2010


Hi all,

Suppose that I've two data frames, a and b say, both containing a column
'id'. While data frame 'a' contains multiple rows sharing the same id,
data frame 'b' contains just one entry per id (i.e. a 1 to n
relationship). For the ease of modeling I now want to generate a new
data frame c, which is basically a copy of data frame 'a' augmented by
the values of b. If I have

a <- data.frame(id = rep(1:3, each=3), val=rnorm(9))
b <- data.frame(id=1:3, set1=LETTERS[1:3], set2=5:7)

the resulting data frame should look like:

c <- data.frame(id = rep(1:3, each=3), val = a$val,
set1=rep(LETTERS[1:3], each=3), set2 = rep(5:7, each = 3))
       
While this task is just an application of some 'rep's and 'c's for
structured data frames, it is somehow cumbersome (and error prone) to
construct 'c' explicitly for less structured data. Thus, I was thinking
of making use of R's smart indexing possibilities to generate an index
vector, i.e.:

ind <- c(1, 1, 1, 2, 2, 2, 3, 3, 3)
c.prime <- cbind(a, b[ind,-1])
rownames(c.prime) <- NULL
all.equal(c.prime , c) # TRUE

The way I generate the index vector ind for the moment is 

tmp <- seq_along(b$id)
names(tmp) <- b$id
ind <- tmp[a$id]

However, I think that there should be a smarter way of doing that
without the need of defining a temporary variable. Some combination of
match, which, %in% maybe? Any hints?

While writing these lines, I think

ind <- pmatch(a$id, b$id, duplicates=T)

could do the job? Or do I run into troubles regarding the "partial
matching" involved in pmatch?

BTW, is there a way to prevent R of assigning [row|col]names? In the
example above I had to remove the rownames generated by rbind
explicitly, is there an one-liner?

Thanks for your input + BR

Thorn



More information about the R-help mailing list