[BioC] getting a list of genomic sequences from a list of gene names

Sat Jul 20 01:52:01 CEST 2013

Hi,

On 07/16/2013 06:40 PM, Michael Lawrence wrote:
> This is an interesting problem. I'm always trying to find simpler ways to
> do these things.  See below.
>
>
>
>
>
> On Tue, Jul 16, 2013 at 10:48 AM, Paul Shannon <pshannon at fhcrc.org> wrote:
>
>> To summarize our off-list discussion, and present our final suggested
>> answer, using a sample entrez geneID to demonstrate, we offer the code
>> shown below.
>> A somewhat lengthy exposition of the 9 steps involved will be found below,
>> following  the 18 lines of excutable code.
>>
>> Thanks for posing this question, Eric, and for your patience while I
>> worked up a reply.
>>
>>   - Paul
>>
>>
>>
>> library(Biostrings)
>> library(org.Hs.eg.db)
>> library (BSgenome.Hsapiens.UCSC.hg19)
>> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
>>
>> trak2 <- "66008"    # sample gene
>>
>> transcripts.grL <- transcriptsBy(txdb, by="gene")[trak2]
>> transcript.names <- mcols(unlist(transcripts.grL))$tx_name  # "uc002uyb.4"
>> "uc002uyc.2"
>>
>>
> It looks like we needed to get the transcripts out just to retrieve the
> transcript names for the gene. I think we can do this more directly with
> OrganismDb.
>
> transcript.names <- select(Homo.sapiens, trak2, "TXNAME", "ENTREZID")$TXNAME
>
> That syntax is very unintuitive to me, but it works.
> Now the fundamental gene structure is represented by the return value of
> exonsBy(),
> exonsByTx.grL <- exonsBy(Homo.sapiens)[transcript.names]
>
> Once we have that, we could go two routes:
> a) Derive the transcript bounds, retrieve the transcript sequence, and
> extract the sequence for the exons and introns.
> b) Derive the introns from the exons and query the sequence for both
> separately.

One difficulty with route (a) is that the coordinates in 'exonsByTx.grL'
are relative to the reference genome so you need to transform
them first to make them relative to the transcript sequences.
For transcripts on the plus strand it's just a shift(), but for
transcripts on the minus strand (which are going to be reverse-
complemented by getSeq()), it's going to be a little bit more
complicated. FWIW, and because it's also an opportunity for me to
introduce a tool I added recently to Biostrings), I'm going to try
to follow route (a) anyway. See below.

>
> For (a), we would get the transcript bounds:
> transcripts.gr <- unlist(range(exonsByTx.grL), use.names=FALSE)

One caveat when using range() on a GRangesList is that it doesn't 
necessarily generate 1 range per list element (e.g. if the transcript
contains exons from both strands, which is uncommon but it happens).
A quick sanity check before you unlist() is to do:

   tmp <- range(exonsByTx.grL)
   stopifnot(all(elementLengths(tmp) == 1))
   transcripts.gr <- unlist(tmp, use.names=FALSE)

> transcripts.seq <- getSeq(Hsapiens, transcripts.gr)
>
> But then we would need some way to extract sequence from a DNAStringSet by
> a GRangesList, returning a DNAStringSetList. This could be an enhancement
> of subseq,XVectorList, or maybe we need XStringViewsList.

Let's try this. Recently I added the extractAt() / replaceAt() combo
to Biostrings that can be used to do multiple substring extractions
from a DNAStringSet object. (Maybe those tools should be called
extractSeqAt() and replaceSeqAt().) Here is how it can be used to
extract the exon sequences from the transcript sequences. As I
mentioned above, the coordinates in 'exonsByTx.grL' are relative to
the reference genome so they first need to be transformed to be
relative to the transcript sequences in 'transcripts.seq' (which
have already been reverse-complemented for transcripts that are on
the minus strand). It's actually going to be easier if we extract
the transcript sequences without reverse-complementing them. The
trick is to set the strand of 'transcripts.gr' to "+" before passing
it to getSeq():

   strand(transcripts.gr) <- "+"
   transcripts.seq <- getSeq(Hsapiens, transcripts.gr)

Then the exon ranges need to be shifted so they are relative to
the transcript sequences in 'transcripts.seq':

   exon_rel_ranges_by_tx <- shift(ranges(exonsByTx.grL), 
-start(transcripts.gr) + 1)

I used ranges() on 'exonsByTx.grL' because I want an IRangesList,
not a GRangesList.

Then we can use extractAt():

   library(Biostrings)
   exon.seqs.by.transcript <- extractAt(transcripts.seq, 
at=exon_rel_ranges_by_tx)

Finally we reverse-complement the individual exon sequences if
they're coming from the minus strand:

   exons.seqs <- unlist(exon.seqs.by.transcript)
   on_minus_idx <- which(unlist(strand(exonsByTx.grL), use.names=FALSE) 
== "-")
   exons.seqs[on_minus_idx] <- reverseComplement(exons.seqs[on_minus_idx])

Of course all those efforts can be avoided by simply doing Paul's
'getSeq(Hsapiens, unlist(exonsByTx.grL))' which immediately gives
us what we want.

In an off-line conversation, Eric asked me if the exon ranges and
their sequences can be bundled together. Of course they can! :-)
Two options: either set the exon ranges as a metadata column of
the 'exons.seqs' object, or do it the otherway around i.e. set
the exons sequences as a metadata column of the GRanges object
containing the exon ranges. Here is how to do the latter:

   exons.gr <- unlist(exonsByTx.grL)
   mcols(exons.gr)$seq <- exons.seqs

Displaying 'exons.gr' is not very pretty but all the information is
here:

 > exons.gr[1:3]
GRanges with 3 ranges and 4 metadata columns:
              seqnames                 ranges strand |   exon_id   exon_name
                 <Rle>              <IRanges>  <Rle> | <integer> <character>
   uc002uyb.4     chr2 [202316073, 202316319]      - |     45437        <NA>
   uc002uyb.4     chr2 [202285140, 202285429]      - |     45436        <NA>
   uc002uyb.4     chr2 [202272126, 202272320]      - |     45435        <NA>
              exon_rank
              <integer>
   uc002uyb.4         1
   uc002uyb.4         2
   uc002uyb.4         3

                 seq

      <DNAStringSet>
   uc002uyb.4 
GCTGGGAGAGTGGCTCTCCTTTGGCTTCCCCAATTGTGTGGGGGCTGCCATTGACCCGGTGTCGCCGCAGAACCGAGGTCGCCGAGTGATGATGTTGTGAAGTCGCCCGCCTGTCCCTGCCACGCCCGGGCGGTTGCTGGCAGTGGGAGCAGCGGCAGCAGCTTCGGCTGCTGCTTTCAGGCTGCCGCTGCATTAGGGGCTTCCTGAGGAAGCGCGGGCGGACGACAGAGGATGCCGAACCACTCCA
   uc002uyb.4 
GTCATGACTGTCCAAAGTATGATAATCACATGAGAGTGCTCGTTGCTACGGATGTCATTTGACTCATCAGAGAAAATCTGTCTAAAAGAAAATATCCATGTGACCAAATCCATTTCATTATTGAATGGCTTGATGGATTTCCTTTACTCTGATTCATACCAAAGCTGTCCTTCTCAACCAAAGCAAGAAAGGATCCTGCATGAGTCAATCCCAGAATGCAATTTTTACATCACCAACAGGTGAAGAAAACCTCATGAATAGCAATCACAGAGACTCGGAGAGCATCACTG
   uc002uyb.4 

ATGTCTGCTCCAATGAGGATCTCCCTGAAGTTGAGCTGGTGAGTCTGCTAGAAGAACAACTACCACAGTATAGGCTAAAAGTAGACACTCTCTTTCTATATGAAAATCAAGACTGGACTCAGTCTCCACACCAGCGGCAGCATGCATCTGATGCTCTCTCTCCAGTCCTTGCTGAAGAGACTTTCCGTTACATGA
   ---
   seqlengths:
                     chr1                  chr2 ...        chrUn_gl000249
                249250621             243199373 ...                 38502

I'm aware that there is potential confusion between the many available
and somewhat redundant tools for extracting subsequences (getSeq,
subseq, and now extractAt). I have a long-term plan to unify those
things, or at least to unify getSeq() and extractAt() (or
extractSeqAt()). One interesting goal (and I know Thomas Girke
has been wanting this for a while -- thanks Thomas for the reminder
yesterday at the BioC conf!) is be to be able to use replaceSeqAt()
on a BSgenome object and a "diff" object to generate a modified
genome. The "diff" object could maybe be a GRanges object, a Vcf
object, or a VRanges object, etc...

Cheers,
H.

>
> For (b), we would get the introns:
> intronsByTx.grL <- psetdiff(transcripts.gr, exonsByTx.grL)
>
> And then if we had support for GRangesList in getSeq (we already have the
> code below!) the rest would be easy. But it would be less efficient, I
> think, than (a), depending on the performance of BSgenome.
>
> Anyway, it looks like some of this stuff should be considered for
> enhancements. In particular, enhancing getSeq with GRangesList support
> would be useful.
>
> Michael
>
>
>> intronsByTx.grL <- intronsByTranscript(txdb,
>> use.names=TRUE)[transcript.names]
>> print(elementLengths(intronsByTx.grL))
>>
>> intron.seqs <- getSeq(Hsapiens, unlist(intronsByTx.grL))
>> intron.seqs.by.transcript <- relist(intron.seqs, skeleton=intronsByTx.grL)
>> print(elementLengths(intron.seqs.by.transcript))
>>      # uc002uyb.4 uc002uyc.2
>>      #         15          7
>>
>> exonsByTx.grL <- exonsBy(txdb, by="tx", use.names=TRUE)[transcript.names]
>> print(elementLengths(exonsByTx.grL))
>>
>> exons.seq <- getSeq(Hsapiens, unlist(exonsByTx.grL))
>> exon.seqs.by.transcript <- relist(exons.seq, skeleton=exonsByTx.grL)
>> print(elementLengths(exon.seqs.by.transcript))
>>      #  uc002uyb.4 uc002uyc.2
>>      #          16          8
>>
>>
>> ---- The strategy
>>
>> 1) DNA sequence for the introns and exons of a gene, is best understood as
>>
>>     "sequence for the introns and exons for each known transcript of a gene"
>>
>> 2) So the first task is to identify all the transcripts associated with
>> the gene of interest
>> This will be represented as a GRangesList, where each element in the list
>> is
>>    a) named by the geneID
>>    b) contains a GRanges of the coordinates, name and id of each associated
>> transcript
>> All of this information comes from a Bioc "TxDB" object, which in this
>> case is our representation
>> of the UCSC "knownGene" table.
>>
>> 3) Once you have such a GRangesList, for one or more geneIDs, you want to
>> ignore (for now)
>> that gene information, and just extract the names of all of the
>> transcripts.  To do
>> this extraction, you must flatted (or "unlist") the GRangesList.
>>
>> 4) Now you have a simple character vector containing the names of all the
>> transcripts associated with your gene of interest.
>>
>> 5) Use those names to obtain, as a GRangesList structured by transcript
>> name, the
>> coordinates of all introns, then of all exons.   Don't be confused or put
>> off
>> that we provide similar methods with slighly dissimilar names:
>>
>>     intronsByTranscript()
>>     exonsBy()
>>
>> 6) You are almost ready to call getSeq () with the coordinates returned at
>> step 5.
>> But getSeq will not accept a GRangesList.  It does accept a GRanges.  So
>> you
>> must unlist the intron and exon coordinates, shedding the "by transcript"
>> structure.
>>
>> 7) getSeq returns a DNAStringSet corresponding to the GRanges (actually,
>> the unlisted GRangesList)
>> you supplied.
>>
>> 8) This DNAstringSet now needs to be relisted, so that the by-transcript
>> organization is
>> imposed, producing a DNAStringSetList whose elements each have a
>> transcript name.
>>
>> 9) All of these intron and exon sequences can associated with the geneID/s
>> you started with
>> by referring to the "transcripts.grL" you created at the very start.
>>
>>
>> On Jun 19, 2013, at 5:33 PM, Eric Foss [guest] wrote:
>>
>>>
>>> Is there a way for me to use Bioconductor to take a list of gene names
>> and give me back a list of genomic sequences, preferably with the exons and
>> introns easily differentiable (e.g. exons in upper case and introns in
>> lower case)?
>>>
>>> Thank you.
>>>
>>> Eric
>>>
>>>
>>> -- output of sessionInfo():
>>>
>>> not relevant
>>>
>>> --
>>> Sent via the guest posting facility at bioconductor.org.
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319