[BioC] findOverlaps question from a newbie...

Valerie Obenchain vobencha at fhcrc.org
Tue May 29 18:03:59 CEST 2012


Hi Gowthaman,

Here is one approach,

(1) create the GRanges or GRangesList from gff

Using a gff file in GenomicFeatures as an example,

 >  gffFile <- system.file("extdata","a.gff3",package="GenomicFeatures")
 >      txdb <- makeTranscriptDbFromGFF(file=gffFile,
    +               format="gff3",
    +               dataSource="partial gtf file for Tomatoes for testing",
    +               species="Solanum lycopersicum",
    +               circ_seqs=DEFAULT_CIRC_SEQS,
    +               miRBaseBuild=NULL)

If you are interested in genes either exons by genes or transcripts by 
genes might be useful,

   exbygene <- exonsBy(txdb, "gene")
   txbygene <- transcriptsBy(txdb, "gene")

The gene names are associated with each outer list element of the 
GRangesList,

 > exbygene
   GRangesList of length 488:
   $gene:Solyc00g005000.2
   GRanges with 2 ranges and 2 elementMetadata cols:
           seqnames         ranges strand |   exon_id                 
exon_name
<Rle> <IRanges> <Rle> | <integer> <character>
     [1] SL2.40ch00 [16437, 17275]      + |         1 
exon:Solyc00g005000.2.1.1
     [2] SL2.40ch00 [17336, 18189]      + |         2 
exon:Solyc00g005000.2.1.2

   $gene:Solyc00g005020.1
   GRanges with 3 ranges and 2 elementMetadata cols:
           seqnames         ranges strand | exon_id                 
exon_name
     [1] SL2.40ch00 [68062, 68211]      + |       3 
exon:Solyc00g005020.1.1.1
     [2] SL2.40ch00 [68344, 68568]      + |       4 
exon:Solyc00g005020.1.1.2
     [3] SL2.40ch00 [68654, 68764]      + |       5 
exon:Solyc00g005020.1.1.3


(2) findOverlaps()

For the sake of example, I'll take a subset of one of the GRangesLists 
and overlap it with the full list,

   subset <- exbygene[c(2,4,6)]
   olap <- findOverlaps(subset, exbygene)

 > olap
Hits of length 3
queryLength: 3
subjectLength: 488
   queryHits subjectHits
<integer> <integer>
  1         1           2
  2         2           4
  3         3           6

The information in the 'Hits' object can be used to map back to the 
query and subject used in findOverlaps().
See ?queryHits and ?subjectHIts.

Next identify which subjects were hit by the queries and extract the 
gene names,

   names(exbygene[subjectHits(olap)])


(3) reconsiling gene sets

Take a look at the 'type' argument to findOverlaps(). To find exact 
matches you'll want to find ranges with type='equal', to find any 
overlaps use type='any' (default).

You may also want to look at summarizeOverlaps(). It is an extension of 
findOverlaps() that counts a read a maximum of once (i.e., no double 
counting). If a read hits multiple features different rule sets are 
applied to assign the read to a single feature.

Valerie



On 05/24/2012 10:05 AM, gowtham wrote:
> Hi Everyone,
> This is my first attempt in R/bioconductor outside of plot()/hist()
> function. Thanks to Martin Morgan, I was impressed with what
> bioconductor can do for me ( after attending bioc workshop last week).
>
> When I try to findOverlap on two GFF files I get information on which
> feature is overlapping with another feature. But, instead of using "feature
> names" it just indicates 1,2, 3...etc. Is there a way, i can make it output
> gene names. (my GRanges objects are read from gff files using Rtracklayer&
> later have names (geneids) assigned to them using names() function).
>
>> findOverlaps(au, am)
> Hits of length 1740
> queryLength: 790
> subjectLength: 1054
>       queryHits subjectHits
>        <integer>    <integer>
>   1            1           1
>   2            1         528
>   3            2          13
>   4            2         540
>   5            3         136
>
>
>
> And here is the background of what i am trying to do. In case, if any of
> you have a better suggestion:
> I run 3 de novo gene prediction algorithm on our newly assembled genome.
> And I would like to reconcile the gene sets from all of them to genes (a)
> predicted with exact same boundaries in all three (b) boundaries not same
> but overlap (c) predicted by only one or two of them.
>
> And use reduce in case of (b) to obtain outer most boundaries of
> overlapping prediction.
>
> As a novice users its bit overwhelming for me.
>
> Thanks very much,
> Gowthaman
>



More information about the Bioconductor mailing list