[R] speed up subsetting with certain conditions

Thu Jan 13 00:12:19 CET 2011

On 1/12/2011 2:52 PM, Duke wrote:
> Hi folks,
>
> I am working on a project that requires subsetting of a found file 
> based on some known file. The known file contains several lines like 
> below:
>
> chr1    3237546    3237547    rs52310428    0    +
> chr1    3237549    3237550    rs52097582    0    +
> chr2    4513326    4513327    rs29769280    0    +
> chr2    4513337    4513338    rs33286009    0    +
>
> where the first column can be chr2, chr1, chr12 etc... The second and 
> third are numbers (cordinates). The found file contains lines like:
>
> chr1    3213435    G    C
> chr1    3237547    T    C
> chr1    3237549    G    T
> chr2    4513326    A    G
> chr2    4513337    C    G
>
> where the first column, again, can be chr1, chr2, chr12 etc... and the 
> second is a number. What I have to do is to separate the found file to 
> two files: one (foundY) contains lines that have the same first column 
> and the second column in range of the two columns 2 and 3 of any line 
> of known file, and one (foundN) contains lines that do not meet the 
> previous condition. For the two examples above, foundN will be the 
> first line, and foundY will be the next 4 lines.
>
> What I came up with is this algorithm:
>
> * get the uniq item in the first column of found file (chr1, chr2, 
> chr12, chr13 etc...)
> * for each of the uniq item, set subset of the known file and the 
> found file that have same first column, then scanning each item in the 
> known subset to see if any line meets any condition
>
> The code is like below:
>
> ########## CODE START###########
> # import known and found files to data frames
> known <- read.table( "known.txt", sep="\t", header=FALSE )
> found <- read.table( "found.txt", sep="\t", header=FALSE, fill=TRUE )
>
> # get the uniq item in first column of found file
> found.Chr <- as.character(found[!duplicated(found[[1]]),1])
>
> # create two empty result data frames
> foundN <- found[0,]
> foundY <- found[0,]
>
> # scan for each of the uniq items
> for ( iChr in found.Chr ) {
>   # subset of known and found with specific item
>   found.iChr <- found[found[[1]]==iChr,]
>   known.iChr <- known[known[[1]]==iChr,]
>
>   # scan through all found subset items
>   if ( nrow(known.iChr)>0 ) {
>     for ( i in 1:nrow(found.iChr) ) {
>       if ( nrow(known.iChr[known.iChr[[3]]>=found.iChr[i,2] & 
> known.iChr[[2]]<=found.iChr[i,2],])==0 ) {
>           foundN <- rbind( foundN, found.iChr[i,] )
>       } else {
>           foundY <- rbind( foundN, found.iChr[i,] )
>       }
>     }
>   }
> }
>
> ########## CODE END###########
>
> The code works well, but I tested it for only small known and found 
> files. When trying with larger files (the known file can contains ~ 15 
> million lines, the found ~ 15k lines), it takes like hrs to run.
>
> I want to speed up the process, and I believe there must be a better 
> algorithm to do this with R. My questions are:
>
> * any body has a better algorithm or comments or suggestion?

The Bioconductor project has many tools for dealing with 
sequence-related data. With the data

k <- read.table(textConnection(
"chr1    3237546    3237547    rs52310428    0    +
chr1    3237549    3237550    rs52097582    0    +
chr2    4513326    4513327    rs29769280    0    +
chr2    4513337    4513338    rs33286009    0    +"))

f <- read.table(textConnection(
"chr1    3213435    G    C
chr1    3237547    T    C
chr1    3237549    G    T
chr2    4513326    A    G
chr2    4513337    C    G"))

One might use the GenomicRanges package as

library(GenomicRanges)
kgr <- with(k, GRanges(V1, IRanges(V2, V3, names=V4), V6, score=V5))
fgr <- with(f, GRanges(V1, IRanges(V2, width=1), V3=V3, V4=V4))
olaps <- findOverlaps(fgr, kgr)
idx <- countOverlaps(fgr, kgr) != 0

resulting in

 > idx
[1] FALSE  TRUE  TRUE  TRUE  TRUE

This will be fast.

One could write foundY with as.data.frame(fgr[idx]) (maybe a little 
editing) but likely one would want to stay in R / Bioc and do something 
more interesting...

See

http://bioconductor.org/install/index.html

Martin

> * I read (google) that matrices work faster than data frame. Can I use 
> matrices for this case? (is matrices for numbers only?)
> * I read (google) that I should avoid rbind, and prelocate data frame 
> for faster speed. How would I do that in this case?
>
> Thank you very much in advance,
>
> Bests,
>
> D.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Dr. Martin Morgan, PhD
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109