[R] Is that an efficient way to find the overlapped , upstream and downstream ranges for a bunch of ranges

何尧 heyao at pku.edu.cn
Tue Apr 5 19:27:48 CEST 2016


I do have a bunch of genes ( nearly ~50000)  from the whole genome, which read in genomic ranges

A range(gene) can be seem as an observation has three columns chromosome, start and end, like that

       seqnames start end width strand

gene1     chr1     1   5     5      +

gene2     chr1    10  15     6      +

gene3     chr1    12  17     6      +

gene4     chr1    20  25     6      +

gene5     chr1    30  40    11      +

I just wondering is there an efficient way to find overlapped, upstream and downstream genes for each gene in the granges

For example, assuming all_genes_gr is a ~50000 genes genomic range, the result I want like belows:

gene_nameupstream_genedownstream_geneoverlapped_gene
gene1NAgene2NA
gene2gene1gene4gene3
gene3gene1gene4gene2
gene4gene3gene5NA

Currently ,  the strategy I use is like that,  
library(GenomicRanges)
find_overlapped_gene <- function(idx, all_genes_gr) {
  #cat(idx, "\n")
  curr_gene <- all_genes_gr[idx]
  other_genes <- all_genes_gr[-idx]
  n <- countOverlaps(curr_gene, other_genes)
  gene <- subsetByOverlaps(curr_gene, other_genes)
  return(list(n, gene))
}​

system.time(lapply(1:100, function(idx)  find_overlapped_gene(idx, all_genes_gr)))
However, for 100 genes, it use nearly ~8s by system.time().That means if I had 50000 genes, nearly one hour for just find overlapped gene. 

I am just wondering any algorithm or strategy to do that efficiently, perhaps 50000 genes in ~10min or even less

 



	[[alternative HTML version deleted]]



More information about the R-help mailing list