[BioC] Extract Exon Features from GenomicFeatures Leads to Inconsistent Pairing of Gene and Exon IDs

Sun Jan 13 11:36:44 CET 2013

Hi Fong,

On Sun, Jan 13, 2013 at 3:09 AM, Fong Chun Chan <fongchun at alumni.ubc.ca> wrote:
> Hi,
>
> I was hoping someone could explain this weird situation to me when I am
> trying to extract exon information using the GenomicFeatures Bioconductor
> package.  Here is what I've done:
>
> txdb <- loadFeatures(txdbFile)
> allExons <- exons(txdb, columns = c('gene_id', 'exon_id'))
> annotDf <- values(allExons)
> dim(annotDf)
>
> [1] 541825 2
>
> According to this code, I've extracted a total of 541825 exons which are
> associated with x number of genes. Now when I do the following:
>
> length(unlist(annotDf[, 'gene_id']))
>
> [1] 546664
>
> How is it possible that there are suddenly an additional 4839 genes added
> to the 'gene_id' column?

The first clue is that the gene_id column is stored as a *List type,
which you correctly realized because you are `unlisting` it it to get
its length. The reason that it's a list is so that one could store
more than one element per index, otherwise they would have just used a
character (or factor) vector.

So, what you can do is to see what's going on here -- I am using the
latest version of R and the "TxDb.Hsapiens.UCSC.hg19.knownGene"
package, so things might not be exactly the same for you, but I said
up my vars to mimc yours, so:

R> gene.ids <- annotDf$gene_id
R> counts <- elementLengths(gene.ids) ## number of elements per bin
R> table(counts)
counts
     1      2      3      4      5      6      9     15     22
282939   3788    101      7      6      1      4      3      3

So we see that several gene ids can be assigned to one exon. You might
as why, and I reckon the answer is that the exons are defined as
"unique" simply by their chr,strand,start,end combo and several genes
(or transcripts) might share the same exon (in "genomic space"). Let's
see:

R> allExons[elementLengths(gene.ids) > 1]
GRanges with 3913 ranges and 2 metadata columns:
         seqnames               ranges strand   |
gene_id   exon_id
            <Rle>            <IRanges>  <Rle>   |
<CompressedCharacterList> <integer>
     [1]     chr1 [  324288,   324345]      +   |
100132287,100133331        12
     [2]     chr1 [10490159, 10490625]      +   |
100526739,378708       897
     [3]     chr1 [10490804, 10491458]      +   |
100526739,378708       898
     [4]     chr1 [10493899, 10494022]      +   |
100526739,378708       899
     [5]     chr1 [10494714, 10494747]      +   |
100526739,378708       900
     [6]     chr1 [10500404, 10500470]      +   |
100526739,378708       901
     [7]     chr1 [10511434, 10512060]      +   |
100526739,1325       904
     [8]     chr1 [16355621, 16355794]      +   |
1187,1188      1504
     [9]     chr1 [19923471, 19923603]      +   |
100532736,440574      1788
     ...      ...                  ...    ... ...
 ...       ...
  [3905]     chrY [26944570, 26944641]      -   |
57054,57135    274737
  [3906]     chrY [26946960, 26947031]      -   |
57054,57135    274738
  [3907]     chrY [26949403, 26949474]      -   |
57054,57135    274739
  [3908]     chrY [26950875, 26951014]      -   |
57054,57135    274740
  [3909]     chrY [26951104, 26951167]      -   |
57054,57135    274741
  [3910]     chrY [26951604, 26951655]      -   |
57054,57135    274742
  [3911]     chrY [26952216, 26952307]      -   |
57054,57135    274743
  [3912]     chrY [26952582, 26952728]      -   |
57054,57135    274744
  [3913]     chrY [26959330, 26959639]      -   |
57054,57135    274745

If you hop to any of these regions in your favorite genome browser,
you should find overlapping gene annotations there that share exact
exon boundaries across transcripts.

> The reason this is a problem is that I would like
> to create a 3 column output where I use the command:
>
> data.frame( geneID = unlist(annotDf[, 'gene_id']), exonID = annotDf[,
> 'exon_id'], exonCounts = exonCounts)
>
> Where exonCounts is generated with the countOverlaps() function using
> allExons and reads from a bam file.  This command is failing because it
> seems that the number of gene_ids is not pairing up with the exon_ids
> properly. Is there something I am missing here?

Hopefully this clarification helps?

And also:

> I am using 1.6.9 of GenomicFeatures, R is 2.14.2 and my txdb was built
> using the command:

You should upgrade to the latest version of R and bioconductor
packages to get the best functionality (and help).

Hope that helps,
-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact