[BioC] Deleting object rows while looping

Mon Apr 29 18:31:45 CEST 2013

Greetings,

On Mon, Apr 29, 2013 at 6:39 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> On 04/29/2013 01:28 AM, Daniel Berner wrote:
>>
>> Hi
>> Can someone help me with this question: I have a large data frame (say
>> 'dat') with 2 columns, one is genomic loci (chromosome-by-position, e.g.
>> 'chr1_1253454'), the other is Illumina sequences. Now I want to perform some
>> operations on each UNIQUE locus. I thus derive the unique loci:
>>
>> u.loc<-unique(dat[,1])
>>
>> … and build a loop allowing me to access the relevant data for each unique
>> locus, and to perform my operations:
>>
>> for(i in 1:length(u.loc)){
>>              subdat<-subset(dat, dat[,1]==u.loc[i])
>>              # now the relevant sequence data are accessible for my
>> operations...
>> }
>
>
> Hi Daniel --
>
> One possibility is to use split
>
>   lst = split(dat[,2], dat[,1])
>
> (it would be very expensive to create, say, 1 million data.frames in a call
> like split(dat, dat[,1]); stick with vectors only if possible) and then
> l/s/mapply
>
>   result = lapply(lst, doWork)
>
> but probably better is to think about how to implement 'doWork' so that it
> operates on the entire vector of sequences, to avoid the cost of invoking
> doWork on each unique value of dat[,1]. Hints about what is in 'doWork'
> might lead to some suggestions on how to make it vectorized (or functions
> that already implement this efficiently).

These are the situations where the data.table package really shines,
if you'd like to give it a shot.

Say the colnames of your data.frame are "locus" and "sequence", you
would do something like so:

dt <- data.table(dat, key="locus")

result <- [, {
  ## The sequences that are in the current subset
  ## are injected into the scope of this expression
  ## by the name of their column (`sequence`), say
  ## you wanted to count the number of GATACA
  ## motifs in this subset (or something):
  list(n.motifs=length(grep('GATACA', sequence))
}, by='locus']

I'd probably store this data with chromosome and position split, ie.
with column names such as "chr", "pos", "sequence", then convert this
to a data.table and set the key to be c("chr", "pos"), eg:

dt <- data.table(dat, key=c("chr", "pos"))
result <- [, {
  list(n.motifs=length(grep('GATACA', sequence))
}, by=c("chr", "pos")]

You should find a large difference (for the better) in terms of speed
and memory use as compared to other approaches (split, ddply, etc.)
given the size of your data -- of course sometimes ||-ization is the
right way to go, but you can try both and see.

HTH,
-steve

--
Steve Lianoglou
Computational Biologist
Department of Bioinformatics and Computational Biology
Genentech