[R] data frame pointers?

Thu Oct 24 02:39:16 CEST 2013

On Oct 23, 2013, at 5:24 PM, David Winsemius wrote:

> 
> On Oct 23, 2013, at 4:36 PM, Jon BR wrote:
> 
>> Hello,
>>   I've been running several programs in the unix shell, and it's time to
>> combine results from several different pipelines.  I've been writing shell
>> scripts with heavy use of awk and grep to make big text files, but I'm
>> thinking it would be better to have all my data in one big structure in R
>> so that I can query whatever attributes I like, and print several
>> corresponding tables to separate files.
>> 
>> I haven't used R in years, so I was hoping somebody might be able to
>> suggest a solution or combinatin of functions that could help me get
>> oriented..
>> 
>> Right now, I can import my data into a data frame that looks like this:
>> 
>> df <-
>> data.frame(case=c("case_1","case_1","case_2","case_3"),gene=c("gene1","gene1","gene1","gene2"),issue=c("nsyn","amp","del","UTR"))
>>> df
>>   case  gene issue
>> 1 case_1 gene1  nsyn
>> 2 case_1 gene1   amp
>> 3 case_2 gene1   del
>> 4 case_3 gene2   UTR
>> 
>> 
>> I'd like to cook up some combination of functions/scripting that can
>> convert a table like df to produce a list or a data frame/ matrix that
>> looks like df2:
>> 
>>> df2
>>       case_1 case_2 case_3
>> gene1 nsyn,amp    del      0
>> gene2        0      0    UTR
>> 
>> I can build df2 manually, like this:
>> df2
>> <-data.frame(case_1=c("nsyn,amp","0"),case_2=c("del","0"),case_3=c("0","UTR"))
>> rownames(df2)<-c("gene1","gene2")
> 
> Factors will be a hassle:
> 
> df <-
> data.frame(case=c("case_1","case_1","case_2","case_3"), gene=c("gene1","gene1","gene1","gene2"), issue=c("nsyn","amp","del","UTR"), stringsAsFactors=FALSE)

Note also that stringsAsFactors can be set globally with options as well as during input functions with any of hte cousins of read.table.

> df
> 
> with( df, matrix( tapply(issue, list(gene, case), list) ,
>                   nrow=length(unique(gene)),ncol=length(unique(case)) )
>      )
> 
>     [,1]        [,2]  [,3] 
> [1,] Character,2 "del" NA   
> [2,] NA          NA    "UTR"
> 
>> dmat[1,1]
> [[1]]
> [1] "nsyn" "amp" 
> 
>> as.data.frame(dmat)
>         V1  V2  V3
> 1 nsyn, amp del  NA
> 2        NA  NA UTR
> 

It's possible that coming back to R after many years you are not familiar with data.table. It's particularly well suited for large text files. It's syntax with argumets to "[" is quite different.

> dt <- data.table(df)
# To make a list in each category you would need to supply a "doubly `list`-ed" arguemtn to "j".

> dt[ , list(list(issue)), by=c("gene", 'case') ]
    gene   case       V1
1: gene1 case_1 nsyn,amp
2: gene1 case_2      del
3: gene2 case_3      UTR

> dt[ , list(issue), by=c("gene", 'case') ]
    gene   case issue
1: gene1 case_1  nsyn
2: gene1 case_1   amp
3: gene1 case_2   del
4: gene2 case_3   UTR

> 
>> 
>> but obviously do not want to do this by hand; I want R to generate df2 from
>> df.
>> 
>> Any pointers/ideas would be most welcome!
>> 
>> Thanks,
>> Jonathan
>> 
>> 	[[alternative HTML version deleted]]
> 
> R is a plain text mailing list. Old school, admittedly,  but much better for coding questions. Surely an awk user can appreciate the wisdom of that request?
> 
> -- 
> David Winsemius
> Alameda, CA, USA
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA