[BioC] ties & strandedness in distanceToNearest GRanges

Valerie Obenchain vobencha at fhcrc.org
Wed Feb 27 18:52:54 CET 2013


Hi Tom,

Thanks for reporting the bug. A fix was checked into GenomicRanges 
1.11.32 (devel) and 1.10.7 (release). Both versions will be available 
through biocLite() on Thursday. The problem was in nearest(). The 
distance computation for potential nearest ranges was correct for "+" 
but not for "-".

Using the first negative range in your example below as the query,

     query <- GRanges("10", IRanges(96723746, width=2), strand="-")

and a small collection of transcripts in 'rat' that surround this range 
as the subject.

     subject <- GRanges("10",
                     IRanges(c(95919265, 97203491),
                         c(96311060, 97204143)),
                     strand=c("-", "-"))


Previously nearest was giving the correct answer for "+" but not "-".
 > nearest(query, subject, ignore.strand=TRUE)
[1] 1
 > nearest(query, subject, ignore.strand=FALSE)
[1] 2

Now strand is handled correctly.
 > nearest(query, subject, ignore.strand=TRUE)
[1] 1
 > nearest(query, subject, ignore.strand=FALSE)
[1] 1


Valerie


On 02/23/13 11:11, Tom Oates wrote:
> Hi
> I am using distanceToNearest on a datset of CpG dinucleotides and the rat
> transcripts from the latest ensembl build.
> Datasets as below:
>
> CpGs
> GRanges with 6 ranges and 4 metadata columns:
>        seqnames                 ranges strand |
>           <Rle>              <IRanges>  <Rle> |
>    [1]       10 [ 96723746,  96723747]      - |
>    [2]        7 [ 13641170,  13641171]      + |
>    [3]       16 [ 17772801,  17772802]      - |
>    [4]        3 [ 88173502,  88173503]      - |
>    [5]       13 [106979682, 106979683]      + |
>    [6]        9 [104393139, 104393140]      + |
>
> rat <- makeTranscriptDbFromBiomart(
>    biomart="ENSEMBL_MART_ENSEMBL",
>    dataset='rnorvegicus_gene_ensembl',
>    host="ensembl.org")
> rat_tx<-transcripts(rat)
>
> distances<-distanceToNearest(diff.cpgs.gr, rat.transcripts, ignore.strand=F)
>
> distances
> DataFrame with 1133 rows and 3 columns
>       queryHits subjectHits  distance
>       <integer>   <integer> <integer>
> 1            1        5962    479744
> 2            2       23710     65549
> 3            3       11077    199011
> 4            4       18109    101821
> 5            5        8159    664239
> 6            6       27327    457961
> 7            7       25795         0
> 8            8       25108     26868
> 9            9       14471    202908
>
> When I manually look through the object "distances" I have found that some
> negative strand CpGs have been assigned nearest transcripts which aren't
> the nearest.
> For example,
>
> ===========B==============B==CG========A=======A===
>
> The object distances contains a subjectHit reference to transcript A even
> though the CG is nearer to transcript B (and the transcript is on the
> negative strand so it would make more sense anyway to go to transcript B).
> The problem is not solved by:
> distanceToNearest(diff.cpgs.gr, rat.transcripts, ignore.strand=F,
> select=all)
>
> Any help would be appreciated
> Thanks
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list