[BioC] Extract Exon Features from GenomicFeatures Leads to Inconsistent Pairing of Gene and Exon IDs
Hervé Pagès
hpages at fhcrc.org
Tue Feb 19 20:49:38 CET 2013
Hi Fong,
On 01/13/2013 10:36 AM, Fong Chun Chan wrote:
> Thanks Steve. That clarifies what is going on to me. I suppose the only way
> to solve this to take the first gene id from an index that contains > 1
> gene ids.
Problem with this solution is that if exon1 belongs to gene1 and gene2,
then gene2 will have a missing exon in you final data.frame. A better
solution is to create 2 rows for exon1 in the data.frame:
gene1 exon1
gene2 exon1
If 'annotDf' is your DataFrame with the gene_id and exon_id cols (of
types *List and integer, respectively), you can obtain the data.frame
with:
gene_id <- annotDf[, 'gene_id']
exon_id <- annotDf[, 'exon_id']
df <- data.frame(geneID=unlist(gene_id),
exonID=rep(exon_id, elementLengths(gene_id)))
Note that there is no loss of information with this approach i.e. it's
possible (and very easy) to reconstruct 'annotDf' from 'df'.
Hope this helps,
H.
>
> Fong
>
>
> On Sun, Jan 13, 2013 at 2:36 AM, Steve Lianoglou <
> mailinglist.honeypot at gmail.com> wrote:
>
>> Hi Fong,
>>
>> On Sun, Jan 13, 2013 at 3:09 AM, Fong Chun Chan <fongchun at alumni.ubc.ca>
>> wrote:
>>> Hi,
>>>
>>> I was hoping someone could explain this weird situation to me when I am
>>> trying to extract exon information using the GenomicFeatures Bioconductor
>>> package. Here is what I've done:
>>>
>>> txdb <- loadFeatures(txdbFile)
>>> allExons <- exons(txdb, columns = c('gene_id', 'exon_id'))
>>> annotDf <- values(allExons)
>>> dim(annotDf)
>>>
>>> [1] 541825 2
>>>
>>> According to this code, I've extracted a total of 541825 exons which are
>>> associated with x number of genes. Now when I do the following:
>>>
>>> length(unlist(annotDf[, 'gene_id']))
>>>
>>> [1] 546664
>>>
>>> How is it possible that there are suddenly an additional 4839 genes added
>>> to the 'gene_id' column?
>>
>> The first clue is that the gene_id column is stored as a *List type,
>> which you correctly realized because you are `unlisting` it it to get
>> its length. The reason that it's a list is so that one could store
>> more than one element per index, otherwise they would have just used a
>> character (or factor) vector.
>>
>> So, what you can do is to see what's going on here -- I am using the
>> latest version of R and the "TxDb.Hsapiens.UCSC.hg19.knownGene"
>> package, so things might not be exactly the same for you, but I said
>> up my vars to mimc yours, so:
>>
>> R> gene.ids <- annotDf$gene_id
>> R> counts <- elementLengths(gene.ids) ## number of elements per bin
>> R> table(counts)
>> counts
>> 1 2 3 4 5 6 9 15 22
>> 282939 3788 101 7 6 1 4 3 3
>>
>> So we see that several gene ids can be assigned to one exon. You might
>> as why, and I reckon the answer is that the exons are defined as
>> "unique" simply by their chr,strand,start,end combo and several genes
>> (or transcripts) might share the same exon (in "genomic space"). Let's
>> see:
>>
>> R> allExons[elementLengths(gene.ids) > 1]
>> GRanges with 3913 ranges and 2 metadata columns:
>> seqnames ranges strand |
>> gene_id exon_id
>> <Rle> <IRanges> <Rle> |
>> <CompressedCharacterList> <integer>
>> [1] chr1 [ 324288, 324345] + |
>> 100132287,100133331 12
>> [2] chr1 [10490159, 10490625] + |
>> 100526739,378708 897
>> [3] chr1 [10490804, 10491458] + |
>> 100526739,378708 898
>> [4] chr1 [10493899, 10494022] + |
>> 100526739,378708 899
>> [5] chr1 [10494714, 10494747] + |
>> 100526739,378708 900
>> [6] chr1 [10500404, 10500470] + |
>> 100526739,378708 901
>> [7] chr1 [10511434, 10512060] + |
>> 100526739,1325 904
>> [8] chr1 [16355621, 16355794] + |
>> 1187,1188 1504
>> [9] chr1 [19923471, 19923603] + |
>> 100532736,440574 1788
>> ... ... ... ... ...
>> ... ...
>> [3905] chrY [26944570, 26944641] - |
>> 57054,57135 274737
>> [3906] chrY [26946960, 26947031] - |
>> 57054,57135 274738
>> [3907] chrY [26949403, 26949474] - |
>> 57054,57135 274739
>> [3908] chrY [26950875, 26951014] - |
>> 57054,57135 274740
>> [3909] chrY [26951104, 26951167] - |
>> 57054,57135 274741
>> [3910] chrY [26951604, 26951655] - |
>> 57054,57135 274742
>> [3911] chrY [26952216, 26952307] - |
>> 57054,57135 274743
>> [3912] chrY [26952582, 26952728] - |
>> 57054,57135 274744
>> [3913] chrY [26959330, 26959639] - |
>> 57054,57135 274745
>>
>> If you hop to any of these regions in your favorite genome browser,
>> you should find overlapping gene annotations there that share exact
>> exon boundaries across transcripts.
>>
>>> The reason this is a problem is that I would like
>>> to create a 3 column output where I use the command:
>>>
>>> data.frame( geneID = unlist(annotDf[, 'gene_id']), exonID = annotDf[,
>>> 'exon_id'], exonCounts = exonCounts)
>>>
>>> Where exonCounts is generated with the countOverlaps() function using
>>> allExons and reads from a bam file. This command is failing because it
>>> seems that the number of gene_ids is not pairing up with the exon_ids
>>> properly. Is there something I am missing here?
>>
>> Hopefully this clarification helps?
>>
>> And also:
>>
>>> I am using 1.6.9 of GenomicFeatures, R is 2.14.2 and my txdb was built
>>> using the command:
>>
>> You should upgrade to the latest version of R and bioconductor
>> packages to get the best functionality (and help).
>>
>> Hope that helps,
>> -steve
>>
>> --
>> Steve Lianoglou
>> Graduate Student: Computational Systems Biology
>> | Memorial Sloan-Kettering Cancer Center
>> | Weill Medical College of Cornell University
>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list