[BioC] Deleting object rows while looping

Martin Morgan mtmorgan at fhcrc.org
Mon Apr 29 15:39:39 CEST 2013


On 04/29/2013 01:28 AM, Daniel Berner wrote:
> Hi
> Can someone help me with this question: I have a large data frame (say 'dat') with 2 columns, one is genomic loci (chromosome-by-position, e.g. 'chr1_1253454'), the other is Illumina sequences. Now I want to perform some operations on each UNIQUE locus. I thus derive the unique loci:
>
> u.loc<-unique(dat[,1])
>
> … and build a loop allowing me to access the relevant data for each unique locus, and to perform my operations:
>
> for(i in 1:length(u.loc)){
>              subdat<-subset(dat, dat[,1]==u.loc[i])
>              # now the relevant sequence data are accessible for my operations...
> }

Hi Daniel --

One possibility is to use split

   lst = split(dat[,2], dat[,1])

(it would be very expensive to create, say, 1 million data.frames in a call like 
split(dat, dat[,1]); stick with vectors only if possible) and then l/s/mapply

   result = lapply(lst, doWork)

but probably better is to think about how to implement 'doWork' so that it 
operates on the entire vector of sequences, to avoid the cost of invoking doWork 
on each unique value of dat[,1]. Hints about what is in 'doWork' might lead to 
some suggestions on how to make it vectorized (or functions that already 
implement this efficiently).

Martin


>
> This works fine. But since I have some 10 million rows in the dat object, the subset() call takes time, hence the whole code is slow. I would therefore like to get rid of the rows already processed within the loop, which would speed up the code as it progresses. I therefore thought about adding this as the last line within the loop:
>
> dat<-dat[-as.integer(row.names(subdat)),]
>
> This should eliminate the processed lines and continuously reduce the dat object’s volume. However, the output I get when using this latter line is incorrect, it does not agree with the output I get without row deletion. It seems deletion does not work correctly. Any idea why this is, and how I could do the row elimination properly?
>
> Thanks!
>
> Daniel Berner
> Zoological Institute
> University of Basel
> Vesalgasse 1
> 4051 Basel
> Switzerland
> +41 (0)61 267 0328
> daniel.berner at unibas.ch
>
> 	[[alternative HTML version deleted]]
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list