[R] Advice needed on awkward tables

Tue May 11 08:53:40 CEST 2010

Greg Orm wrote:
> Dear r-help list members,
>
> I am quite new to R, and hope to seek advice from you about a problem I have
> been cracking my head over. Apologies if this seems like a simple problem.
>
>   
I would not call it exactly simple... but mostly because your data
representation is so obscure.

> I have essentially two tables. The first (Table A) is a standard patient
> clinicopathological data table, where rows correspond to patient IDs and
> columns correspond to clinical features. Records in this table are stored as
> 1 or 0 (denoting presence). An example is provided below.
>
> The second (Table B) is a table that represents a 'key' to Table A. This
> Table B has a category field, as well as a feature field which links to the
> Table B. Unfortunately, this is a one-to-many relationship, and the numbers
> in the feature field represent the respective columns in Table A, delimited
> by semicolons. So in the example below, I need to collapse the data in Table
> B into a table with nrow equivalent to the number of categories and ncol =
> number of patients. The collapsing of each categoriy, will be based on a
> Boolean OR, or the equivalent ANY in R (so long as 1 of the features is
> true, the resulting outcome will be true)
>
> data.table.a <-
> matrix(data=round(runif(100)),nrow=10,ncol=10,dimnames=list(paste("Patient",1:10),paste("Feature",1:10)))
> data.table.b <- data.frame
> (ID=c(1,2,3,4,5,6,7),CATEGORY=c(1,2,3,3,4,5,5),FEATURE=c("9","3;5","7","4","6;10","1;2","8"))
>
> In the example tables above, we hope to collapse the features by category -
> so the final desired output will be a total of 10 patients as rows, and a
> total of 5 categories as columns. (after collapsing the features by a
> Boolean OR). (i.e. if any of the features in the category are present, it
> will be a TRUE).
>
> I apologize for the apparently awkward table, but this is what I had to
> start with. I tried expanding data.table.b$FEATURE using strsplit, which
> resulted in a list, and then I got stuck there for a long time.
>
> Thanks for any help.
>
> Greg
>   
Replacing your bizarre representation of features with something sane...

library(reshape)

data.table.b2 <- 
data.frame(ID=c(1,2,2,3,4,5,5,6,6,7),FEATURE=c(9,3,5,7,4,6,10,1,2,8))

data.table.a.melt <- melt( data.table.a )
names(data.table.a.melt) <- c("Patient", "FEATURE", "value" )

data.table.a.melt$FEATURE <- 
as.numeric(sub("([^0-9]*)(\\d+)$","\\2",data.table.a.melt$FEATURE,perl=TRUE))

data.table.merge1 <- merge( data.table.a.melt, data.table.b2 )
data.table.merge2 <- merge( data.table.merge1, 
data.table.b[c("ID","CATEGORY")] )
data.table.merge2$value <- 0!=data.table.merge2$value
result <- cast(data.table.merge2, Patient ~ CATEGORY, any )