[BioC] exon genomic coordinates

Tue Nov 26 00:30:35 CET 2013

Hi Michael, John,

I was hoping maybe making a TranscriptDb object from the CCDS track
would help here:

   library(GenomicFeatures)
   txdb <- makeTranscriptDbFromUCSC("hg19", "ccdsGene")

but unfortunately it's not easy to link the exons in 'txdb' to
Ensembl transcript or gene ids because 'txdb' lacks this information:

   > ex <- exons(txdb, columns=c("exon_id", "tx_name", "gene_id"))
   > head(ex)
   GRanges with 6 ranges and 3 metadata columns:
       seqnames           ranges strand |   exon_id         tx_name
          <Rle>        <IRanges>  <Rle> | <integer> <CharacterList>
   [1]     chr1 [ 69091,  70008]      + |         1     CCDS30547.1
   [2]     chr1 [367659, 368597]      + |         2     CCDS41220.1
   [3]     chr1 [861322, 861393]      + |         3         CCDS2.2
   [4]     chr1 [865535, 865716]      + |         4         CCDS2.2
   [5]     chr1 [866419, 866469]      + |         5         CCDS2.2
   [6]     chr1 [871152, 871276]      + |         6         CCDS2.2
               gene_id
       <CharacterList>
   [1]              NA
   [2]              NA
   [3]              NA
   [4]              NA
   [5]              NA
   [6]              NA
   ---
   seqlengths:
                    chr1                 chr2 ...       chrUn_gl000249
               249250621            243199373 ...                38502

Querying Ensembl directly to make a TranscriptDb object:

   library(GenomicFeatures)
   txdb <- makeTranscriptDbFromBiomart()  # takes a while! (40 min. for me)
                                          # this used to be much faster,
                                          # don't know what's going on
   saveDb(txdb, file="hsapiens_gene_ensembl_txdb.sqlite")  # save for 
later re-use

   KIT_exons <- exons(txdb, vals=list(gene_id="ENSG00000157404"), 
columns=c("exon_name", "tx_name", "gene_id"))

  tx_names <- unique(unlist(mcols(KIT_exons)$tx_name))
   # tx_names  # 4 transcripts
   # [1] "ENST00000288135" "ENST00000412167" "ENST00000514582" 
"ENST00000512959"

   ex_by_tx <- exonsBy(txdb, by="tx", use.names=TRUE)
   KIT_ex_by_tx <- ex_by_tx[tx_names]

Transcript lengths:

   > sum(width(KIT_ex_by_tx))
   ENST00000288135 ENST00000412167 ENST00000514582 ENST00000512959
              5186            3470             538             746

Pick-up the longest:

 > KIT_ex_by_tx[["ENST00000288135"]]
GRanges with 21 ranges and 3 metadata columns:
        seqnames               ranges strand   |   exon_id       exon_name
           <Rle>            <IRanges>  <Rle>   | <integer>     <character>
    [1]        4 [55524085, 55524248]      +   |    156828 ENSE00001905199
    [2]        4 [55561678, 55561947]      +   |    156830 ENSE00001032350
    [3]        4 [55564450, 55564731]      +   |    156832 ENSE00001074448
    [4]        4 [55565796, 55565932]      +   |    156833 ENSE00001121859
    [5]        4 [55569890, 55570058]      +   |    156834 ENSE00001074426
    ...      ...                  ...    ... ...       ...             ...
   [17]        4 [55599236, 55599358]      +   |    156850 ENSE00001074435
   [18]        4 [55602664, 55602775]      +   |    156852 ENSE00001074442
   [19]        4 [55602887, 55602986]      +   |    156853 ENSE00001224349
   [20]        4 [55603341, 55603446]      +   |    156854 ENSE00001074415
   [21]        4 [55604595, 55606881]      +   |    156856 ENSE00001898693
        exon_rank
        <integer>
    [1]         1
    [2]         2
    [3]         3
    [4]         4
    [5]         5
    ...       ...
   [17]        17
   [18]        18
   [19]        19
   [20]        20
   [21]        21
   ---
   seqlengths:
                    1                 2 ...            LRG_98 
  LRG_99
            249250621         243199373 ...             18750 
   13294

Cheers,
H.

On 11/25/2013 02:19 PM, Michael Lawrence wrote:
> You could use rtracklayer to grab the CCDS track from UCSC. Might be some
> way with Biomart from Ensembl.
>
>
> On Mon, Nov 25, 2013 at 2:06 PM, array chip <arrayprofile at yahoo.com> wrote:
>
>> Thanks Michael. How do I restrict to consensus CDS?
>>
>> John
>>
>>    ------------------------------
>>   *From:* Michael Lawrence <lawrence.michael at gene.com>
>> *To:* array chip <arrayprofile at yahoo.com>
>> *Cc:* "bioconductor at r-project.org" <bioconductor at r-project.org>
>> *Sent:* Monday, November 25, 2013 1:55 PM
>>
>> *Subject:* Re: [BioC] exon genomic coordinates
>>
>> Well, one approach is to take the longest one. That's what UCSC uses to
>> call its "canonical transcripts". And restrict to the consensus CDS (CCDS).
>>
>>
>> On Mon, Nov 25, 2013 at 1:12 PM, array chip <arrayprofile at yahoo.com>wrote:
>>
>> Hi all, have another questions about exon genomic coordinates:
>>
>> library(biomaRt)
>> ensembl = useMart("ensembl")
>>
>> ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
>>
>>> getBM(attributes =
>> c("external_gene_id","chromosome_name","exon_chrom_start","exon_chrom_end","ensembl_transcript_id","rank"), filters
>> = 'hgnc_symbol', values=c("KIT"),mart=ensembl)
>>
>>     chromosome_name exon_chrom_start exon_chrom_end ensembl_transcript_id
>> rank
>> 1                4         55524085       55524248       ENST00000412167
>>   1
>> 2                4         55561678       55561947       ENST00000412167
>>   2
>> 3                4         55564450       55564731       ENST00000412167
>>   3
>> 4                4         55565796       55565932       ENST00000412167
>>   4
>> 5                4         55569890       55570058       ENST00000412167
>>   5
>> 6                4         55573264       55573453       ENST00000412167
>>   6
>> 7                4         55575590       55575705       ENST00000412167
>>   7
>> 8                4         55589750       55589864       ENST00000412167
>>   8
>> 9                4         55592023       55592204       ENST00000412167
>>   9
>> 10               4         55593384       55593490       ENST00000412167
>> 10
>> 11               4         55593582       55593708       ENST00000412167
>> 11
>> 12               4         55593989       55594093       ENST00000412167
>> 12
>> 13               4         55594177       55594287       ENST00000412167
>> 13
>> 14               4         55595501       55595651       ENST00000412167
>> 14
>> 15               4         55597494       55597585       ENST00000412167
>> 15
>> 16               4         55598037       55598164       ENST00000412167
>> 16
>> 17               4         55599236       55599358       ENST00000412167
>> 17
>> 18               4         55602664       55602775       ENST00000412167
>> 18
>> 19               4         55602887       55602986       ENST00000412167
>> 19
>> 20               4         55603341       55603446       ENST00000412167
>> 20
>> 21               4         55604595       55605177       ENST00000412167
>> 21
>> 22               4         55524085       55524248       ENST00000288135
>>   1
>> 23               4         55561678       55561947       ENST00000288135
>>   2
>> 24               4         55564450       55564731       ENST00000288135
>>   3
>> 25               4         55565796       55565932       ENST00000288135
>>   4
>> 26               4         55569890       55570058       ENST00000288135
>>   5
>> 27               4         55573264       55573453       ENST00000288135
>>   6
>> 28               4         55575590       55575705       ENST00000288135
>>   7
>> 29               4         55589750       55589864       ENST00000288135
>>   8
>> 30               4         55592023       55592216       ENST00000288135
>>   9
>> 31               4         55593384       55593490       ENST00000288135
>> 10
>> 32               4         55593582       55593708       ENST00000288135
>> 11
>> 33               4         55593989       55594093       ENST00000288135
>> 12
>> 34               4         55594177       55594287       ENST00000288135
>> 13
>> 35               4         55595501       55595651       ENST00000288135
>> 14
>> 36               4         55597494       55597585       ENST00000288135
>> 15
>> 37               4         55598037       55598164       ENST00000288135
>> 16
>> 38               4         55599236       55599358       ENST00000288135
>> 17
>> 39               4         55602664       55602775       ENST00000288135
>> 18
>> 40               4         55602887       55602986       ENST00000288135
>> 19
>> 41               4         55603341       55603446       ENST00000288135
>> 20
>> 42               4         55604595       55606881       ENST00000288135
>> 21
>> 43               4         55524106       55524248       ENST00000514582
>>   1
>> 44               4         55561678       55562072       ENST00000514582
>>   2
>> 45               4         55595458       55595651       ENST00000512959
>>   1
>> 46               4         55597494       55597585       ENST00000512959
>>   2
>> 47               4         55598037       55598164       ENST00000512959
>>   3
>> 48               4         55599236       55599567       ENST00000512959
>>   4
>>
>> This will give many versions of genomic coordinates. For example, KIT has
>> 3 sets of exons. I think these different versions may refer to different
>> splicing variants/isoforms. Is there a "default"/"standard" set of exons
>> for each gene? and how do I know which one is such one?
>>
>> Thanks
>>
>> John
>>
>>
>>
>>
>> ________________________________
>>
>> To: Hans-Rudolf Hotz <hrh at fmi.ch>; "bioconductor at r-project.org" <
>> bioconductor at r-project.org>
>> Sent: Monday, November 25, 2013 12:41 PM
>> Subject: Re: [BioC] exon genomic coordinates
>>
>>
>> Hi,
>>
>> I am trying to use4 bioMart to retrieve the exon coordinates using the
>> example provided below:
>>
>> library(biomaRt)
>> ensembl = useMart("ensembl")
>>
>> ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
>>
>> getBM(attributes =
>> c("chromosome_name","exon_chrom_start","exon_chrom_end","rank"),
>> filters = 'hgnc_symbol', values=c("KIT"),mart=ensembl)
>>
>> The above works fine. However, when I tried to add "hgnc_symbol" to the
>> attributes list, it gave me error:
>>
>> getBM(attributes =
>>
>> c("hgnc_symbol","chromosome_name","exon_chrom_start","exon_chrom_end","rank"),
>> filters = 'hgnc_symbol', values=c("KIT"),mart=ensembl)
>>
>>
>> Error in getBM(attributes = c("hgnc_symbol", "chromosome_name",
>> "exon_chrom_start",  :
>>    Query ERROR: caught BioMart::Exception::Usage: Attributes from multiple
>> attribute pages are not allowed
>>
>> But if I keep "hgnc_symbol" in the atributes list and
>> remove "exon_chrom_start" and "exon_chrom_end", then it worked again:
>> getBM(attributes =
>> c("hgnc_symbol","chromosome_name","ensembl_transcript_id","rank"),
>> filters = 'hgnc_symbol', values=c("KIT"),mart=ensembl)
>>
>> Can anyone tell me why is that?
>>
>> Thanks
>>
>> John
>>
>>
>> ________________________________
>> From: Hans-Rudolf Hotz <hrh at fmi.ch>
>>
>> onductor at r-project.org>
>> Sent: Thursday, November 21, 2013 5:38 AM
>> Subject: Re: [BioC] exon genomic coordinates
>>
>>
>> Hi John
>>
>> You can use the BioMart database, which you can access with the biomaRt
>> package to get all exons for all transcripts for a given giene, eg:
>>
>> library(biomaRt)
>> ensembl = useMart("ensembl")
>> #assuming you are interested in mouse
>> mouse.ensembl = useDataset("mmusculus_gene_ensembl",mart=ensembl)
>>
>> getBM(attributes =
>>
>> c("chromosome_name","exon_chrom_start","exon_chrom_end","ensembl_exon_id","ensembl_transcript_id","ensembl_gene_id"),
>> filters = 'mgi_symbol', values=c("KIT"),mart=mouse.ensembl)
>>
>>
>> Hope this helps
>>
>> Hans-Rudolf
>>
>>
>> On 11/21/2013 09:14 AM, array chip wrote:
>>> Hi,
>>>
>>>
>>> Can anyone suggest how to retrieve the genomic coordinates for all exons
>> for a given gene by say gene symbol? For example, how to retrieve the
>> coordinates for all 21 exons for gene KIT?
>>>
>>> Thanks
>>>
>>> John
>>>      [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>>
>>>
>>      [[alternative HTML version deleted]]
>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>          [[alternative HTML version deleted]]
>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>>
>>
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319