[BioC] Extract Exon Features from GenomicFeatures Leads to Inconsistent Pairing of Gene and Exon IDs

Tue Feb 19 20:49:38 CET 2013

Hi Fong,

On 01/13/2013 10:36 AM, Fong Chun Chan wrote:
> Thanks Steve. That clarifies what is going on to me. I suppose the only way
> to solve this to take the first gene id from an index that contains > 1
> gene ids.

Problem with this solution is that if exon1 belongs to gene1 and gene2,
then gene2 will have a missing exon in you final data.frame. A better
solution is to create 2 rows for exon1 in the data.frame:

      gene1    exon1
      gene2    exon1

If 'annotDf' is your DataFrame with the gene_id and exon_id cols (of
types *List and integer, respectively), you can obtain the data.frame
with:

   gene_id <- annotDf[, 'gene_id']
   exon_id <- annotDf[, 'exon_id']

   df <- data.frame(geneID=unlist(gene_id),
                    exonID=rep(exon_id, elementLengths(gene_id)))

Note that there is no loss of information with this approach i.e. it's
possible (and very easy) to reconstruct 'annotDf' from 'df'.

Hope this helps,
H.

>
> Fong
>
>
> On Sun, Jan 13, 2013 at 2:36 AM, Steve Lianoglou <
> mailinglist.honeypot at gmail.com> wrote:
>
>> Hi Fong,
>>
>> On Sun, Jan 13, 2013 at 3:09 AM, Fong Chun Chan <fongchun at alumni.ubc.ca>
>> wrote:
>>> Hi,
>>>
>>> I was hoping someone could explain this weird situation to me when I am
>>> trying to extract exon information using the GenomicFeatures Bioconductor
>>> package.  Here is what I've done:
>>>
>>> txdb <- loadFeatures(txdbFile)
>>> allExons <- exons(txdb, columns = c('gene_id', 'exon_id'))
>>> annotDf <- values(allExons)
>>> dim(annotDf)
>>>
>>> [1] 541825 2
>>>
>>> According to this code, I've extracted a total of 541825 exons which are
>>> associated with x number of genes. Now when I do the following:
>>>
>>> length(unlist(annotDf[, 'gene_id']))
>>>
>>> [1] 546664
>>>
>>> How is it possible that there are suddenly an additional 4839 genes added
>>> to the 'gene_id' column?
>>
>> The first clue is that the gene_id column is stored as a *List type,
>> which you correctly realized because you are `unlisting` it it to get
>> its length. The reason that it's a list is so that one could store
>> more than one element per index, otherwise they would have just used a
>> character (or factor) vector.
>>
>> So, what you can do is to see what's going on here -- I am using the
>> latest version of R and the "TxDb.Hsapiens.UCSC.hg19.knownGene"
>> package, so things might not be exactly the same for you, but I said
>> up my vars to mimc yours, so:
>>
>> R> gene.ids <- annotDf$gene_id
>> R> counts <- elementLengths(gene.ids) ## number of elements per bin
>> R> table(counts)
>> counts
>>       1      2      3      4      5      6      9     15     22
>> 282939   3788    101      7      6      1      4      3      3
>>
>> So we see that several gene ids can be assigned to one exon. You might
>> as why, and I reckon the answer is that the exons are defined as
>> "unique" simply by their chr,strand,start,end combo and several genes
>> (or transcripts) might share the same exon (in "genomic space"). Let's
>> see:
>>
>> R> allExons[elementLengths(gene.ids) > 1]
>> GRanges with 3913 ranges and 2 metadata columns:
>>           seqnames               ranges strand   |
>> gene_id   exon_id
>>              <Rle>            <IRanges>  <Rle>   |
>> <CompressedCharacterList> <integer>
>>       [1]     chr1 [  324288,   324345]      +   |
>> 100132287,100133331        12
>>       [2]     chr1 [10490159, 10490625]      +   |
>> 100526739,378708       897
>>       [3]     chr1 [10490804, 10491458]      +   |
>> 100526739,378708       898
>>       [4]     chr1 [10493899, 10494022]      +   |
>> 100526739,378708       899
>>       [5]     chr1 [10494714, 10494747]      +   |
>> 100526739,378708       900
>>       [6]     chr1 [10500404, 10500470]      +   |
>> 100526739,378708       901
>>       [7]     chr1 [10511434, 10512060]      +   |
>> 100526739,1325       904
>>       [8]     chr1 [16355621, 16355794]      +   |
>> 1187,1188      1504
>>       [9]     chr1 [19923471, 19923603]      +   |
>> 100532736,440574      1788
>>       ...      ...                  ...    ... ...
>>   ...       ...
>>    [3905]     chrY [26944570, 26944641]      -   |
>> 57054,57135    274737
>>    [3906]     chrY [26946960, 26947031]      -   |
>> 57054,57135    274738
>>    [3907]     chrY [26949403, 26949474]      -   |
>> 57054,57135    274739
>>    [3908]     chrY [26950875, 26951014]      -   |
>> 57054,57135    274740
>>    [3909]     chrY [26951104, 26951167]      -   |
>> 57054,57135    274741
>>    [3910]     chrY [26951604, 26951655]      -   |
>> 57054,57135    274742
>>    [3911]     chrY [26952216, 26952307]      -   |
>> 57054,57135    274743
>>    [3912]     chrY [26952582, 26952728]      -   |
>> 57054,57135    274744
>>    [3913]     chrY [26959330, 26959639]      -   |
>> 57054,57135    274745
>>
>> If you hop to any of these regions in your favorite genome browser,
>> you should find overlapping gene annotations there that share exact
>> exon boundaries across transcripts.
>>
>>> The reason this is a problem is that I would like
>>> to create a 3 column output where I use the command:
>>>
>>> data.frame( geneID = unlist(annotDf[, 'gene_id']), exonID = annotDf[,
>>> 'exon_id'], exonCounts = exonCounts)
>>>
>>> Where exonCounts is generated with the countOverlaps() function using
>>> allExons and reads from a bam file.  This command is failing because it
>>> seems that the number of gene_ids is not pairing up with the exon_ids
>>> properly. Is there something I am missing here?
>>
>> Hopefully this clarification helps?
>>
>> And also:
>>
>>> I am using 1.6.9 of GenomicFeatures, R is 2.14.2 and my txdb was built
>>> using the command:
>>
>> You should upgrade to the latest version of R and bioconductor
>> packages to get the best functionality (and help).
>>
>> Hope that helps,
>> -steve
>>
>> --
>> Steve Lianoglou
>> Graduate Student: Computational Systems Biology
>>   | Memorial Sloan-Kettering Cancer Center
>>   | Weill Medical College of Cornell University
>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319