[BioC] bug in GenomicRanges gaps?

Cook, Malcolm MEC at stowers.org
Fri Sep 6 02:53:58 CEST 2013


Marco,

genes.noStrand is unset when you first reference it, so it is hard to know what you are trying to do.

Does your definition of intergenic want to ignore strand (i.e. can the same locus be genic on one strand and integenic on another?)

I'm guessing the problem you are witnessing is that the strand _is_ not '*' when you thought you had set it to '*'.  Check this!

Also, does your definition of intergenic want to include non-genic segments that are the very beginning/end of a chromosome (and therefore only have a gene on one-side and thus arguably are not truly inter-genic).  If so, your approach is not going to include these chromosome-terminal regions.  In which case, you might try:

genic.gr<-unlist(transcriptsBy(tr.db,'gene'),use.names=FALSE)                                                                                                                       
strand(genic.gr)<-'*'                                                                                                                                                               
genic.gr<-reduce(genic.gr)                                                                                                                                                          
chrom.gr<-as(seqinfo(tr.db),'GRanges')                                                                                                                                              
intergenic.gr<-setdiff(chrom.gr,genic.gr)   

stopifnot(0==length(findOverlaps(intergenic.gr,ignoreSelf=TRUE)))                                                                                                                   

~ malcolm_cook at stowers.org

________________________________________
From: bioconductor-bounces at r-project.org [bioconductor-bounces at r-project.org] on behalf of Blanchette, Marco
Sent: Thursday, September 05, 2013 7:14 PM
To: BioConductor
Subject: [BioC] bug in GenomicRanges gaps?

Hi,

Is this a bug of normal behavior... (I suspect it's a bug). I am trying to compute the intergenic regions using GenomicFeatures txdb and GenomicRanges gaps() and the GRanges return contains entry spanning all chromosomes on both strand... On a mock GRanges, I don't get this weird behavior (See below). Any suggestions?

thanks

> library(GenomicFeatures)
### Downloading the txdb from UCSC
> txdb <- makeTranscriptDbFromUCSC(genome="dm3",
                                 tablename="ensGene")

### Getting the transcripts for each genes
> allTx <- transcriptsBy(txdb, by='gene')

### Geting the GRanges for each genes
> genes <- unlist(range(allTx))

### Removing the strand to perform the reduce operation
> strand(genes) <- '*'

### Now, let's find the gaps
> genic <- reduce(genes.noStrand)

### Because of teh reduce, there should be no overlaping genic regions
> length(findOverlaps(genic,ignoreSelf=TRUE))
[1] 0

### Finding the intergenic regions
> intergenic <- gaps(genic)

### The difference of non-overlaping ranges should return non-overlaping gaps, but...
> length(findOverlaps(intergenic,ignoreSelf=TRUE))
[1] 46300

### What's funky is that I am getting gaps spanning the full lenght of each chromosomes
### Why???
> chrs.width <- seqlengths(seqinfo(intergenic))
> intergenic[width(intergenic) == gene2chrsLength]
GRanges with 30 ranges and 0 metadata columns:
        seqnames        ranges strand
           <Rle>     <IRanges>  <Rle>
   [1]     chr2L [1, 23011544]      +
   [2]     chr2L [1, 23011544]      -
   [3]     chr2R [1, 21146708]      +
   [4]     chr2R [1, 21146708]      -
   [5]     chr3L [1, 24543557]      +
   ...       ...           ...    ...
  [26]   chrXHet [1,   204112]      -
  [27]   chrYHet [1,   347038]      +
  [28]   chrYHet [1,   347038]      -
  [29] chrUextra [1, 29004656]      +
  [30] chrUextra [1, 29004656]      -
  ---
  seqlengths:
       chr2L     chr2R     chr3L     chr3R ...   chrXHet   chrYHet chrUextra
    23011544  21146708  24543557  27905053 ...    204112    347038  29004656


### On Fake data
> genes <- GRanges(seqnames=c(rep("2L",3),rep("3L",3)),
                 ranges=IRanges(start=rep(c(20,200,350),2),end=rep(c(80,275,450),2)))
> length(findOverlaps(intergenic,ignoreSelf=TRUE))
[1] 0
> intergenic
GRanges with 6 ranges and 0 metadata columns:
      seqnames     ranges strand
         <Rle>  <IRanges>  <Rle>
  [1]       2L [  1,  19]      *
  [2]       2L [ 81, 199]      *
  [3]       2L [276, 349]      *
  [4]       3L [  1,  19]      *
  [5]       3L [ 81, 199]      *
  [6]       3L [276, 349]      *
  ---
  seqlengths:
   2L 3L
   NA NA

--
Marco Blanchette, Ph.D.
Assistant Investigator
Stowers Institute for Medical Research
1000 East 50th St.

Kansas City, MO 64110

Tel: 816-926-4071
Cell: 816-726-8419
Fax: 816-926-2018

_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list