[BioC] TranscriptDb and genomicranges questions

Thu Aug 12 19:54:49 CEST 2010

Hi Vincent,

On 08/12/2010 05:17 AM, Vincent Detours wrote:
> Indeed I was missing these ends. Thanks for the update!
> Now I tryed this to avoids introducing arbitrary strand information:
> ----
>  > strand(tr) <- "*"
>  > gaps(tr)[1:5]
> GRanges with 5 ranges and 0 elementMetadata values
> seqnames ranges strand |
> <Rle> <IRanges> <Rle> |
> [1] chr1 [ 1, 247249719] + |
> [2] chr1 [ 1, 247249719] - |
> [3] chr1 [ 1, 4224] * |
> [4] chr1 [19234, 24474] * |
> [5] chr1 [25945, 58953] * |
>
> seqlengths
> chr1 chr1_random chr10 ... chrX_random chrY
> 247249719 1663265 135374737 ... 1719168 57772954
>  >
> ----
> I don't understand why 'gaps' returns ranges 1 and 2.

Yes, setting the strand to "*" for all transcripts works too.
The "gaps" method for GRanges will compute the gaps for the 3
strand levels (+, -, *) independently of whether there are
transcripts on the level. In your case there are no transcripts
on the + and - levels so the gap you see spans the entire chromosome.

Hope this helps.

Cheers,
H.

>
> Vincent
>
>
>
> On Aug 11, 2010, at 10:27 PM, Hervé Pagès wrote:
>
>> Hi Vincent,
>>
>> Because of the semantic of the "gaps" method for IRanges/IRangesList
>> object, you won't get the gaps that are at the 5' and 3' ends of the
>> chromosomes when using the IRanges/IRangesList trick.
>>
>> An easier way to extract the gaps between transcripts without
>> considering their strand is to artificially "project" all the
>> transcripts on the + strand:
>>
>> txdb <- makeTranscriptDbFromUCSC(genome="hg18", tablename="ensGene")
>> tr <- transcripts(txdb)
>> strand(tr) <- "+"
>>
>> Then:
>>
>> > gaps(tr)[1:5]
>> GRanges with 5 ranges and 0 elementMetadata values
>> seqnames ranges strand |
>> <Rle> <IRanges> <Rle> |
>> [1] chr1 [ 1, 1872] + |
>> [2] chr1 [ 3534, 4273] + |
>> [3] chr1 [19670, 20228] + |
>> [4] chr1 [20367, 24416] + |
>> [5] chr1 [25945, 42911] + |
>>
>> seqlengths
>> chr1 chr1_random chr10 ... chrX_random chrY
>> 247249719 1663265 135374737 ... 1719168 57772954
>>
>> Note the gap at the 5' end of chr1.
>>
>> Of course now your 'tr' object is sort of "broken" so you need to
>> remake it if you want to use it for downstream analyses where the
>> strand actually matters.
>>
>> Cheers,
>> H.
>>
>>
>> On 07/16/2010 03:47 PM, Patrick Aboyoun wrote:
>>> Vincent,
>>> I am cc'ing the Bioconductor mailing list with the response to your
>>> e-mail so others can access it.
>>>
>>> The gaps,GRanges-method is strand specific, so when you pass a GRanges
>>> object containing the transcript bounds into gaps, you will get the gaps
>>> on the positive strand, the negative strand, and the both strand
>>> category. This is why you found a gap between transcripts on the
>>> positive strand, while there was a transcript at some of those same
>>> positions on the negative strand. If you want a strandless
>>> representation of the transcripts, I recommend using the IRanges and
>>> IRangesList classes. Here is code that achieves what I think you are
>>> looking for:
>>>
>>>
>>> > library(GenomicFeatures)
>>> > txdb<- makeTranscriptDbFromUCSC(genome = "hg18", tablename = "ensGene")
>>> > tx<- transcripts(txdb)
>>> > unstrndTx<- split(ranges(tx), seqnames(tx))
>>> > unstrndGaps<- unlist(gaps(unstrndTx))
>>> > intAlt<- GRanges(seqnames = factor(names(unstrndGaps),
>>> names(seqlengths(tx))),
>>> + ranges = unname(unstrndGaps),
>>> + seqlengths = seqlengths(tx))
>>>
>>> > intAlt[1:3]
>>> GRanges with 3 ranges and 0 elementMetadata values
>>> seqnames ranges strand |
>>> <Rle> <IRanges> <Rle> |
>>> [1] chr1 [ 3534, 4273] * |
>>> [2] chr1 [19670, 20228] * |
>>> [3] chr1 [20367, 24416] * |
>>>
>>> seqlengths
>>> chr1 chr1_random chr10 ... chrX_random chrY
>>> 247249719 1663265 135374737 ... 1719168 57772954
>>>
>>> > sessionInfo()
>>> R version 2.12.0 Under development (unstable) (2010-07-01 r52425)
>>> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>>>
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>>
>>> other attached packages:
>>> [1] GenomicFeatures_1.1.6 GenomicRanges_1.1.16 IRanges_1.7.11
>>>
>>> loaded via a namespace (and not attached):
>>> [1] Biobase_2.9.0 biomaRt_2.5.1 Biostrings_2.17.24
>>> BSgenome_1.17.5
>>> [5] DBI_0.2-5 RCurl_1.4-2 RSQLite_0.9-1
>>> rtracklayer_1.9.3
>>> [9] tools_2.12.0 XML_3.1-0
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 7/16/10 6:23 AM, Vincent Detours wrote:
>>>> Dear Patrick,
>>>>
>>>> I am learning your packages genomicranges and transcriptdb, which
>>>> I find impressively efficients. Here are two questions and perhaps a
>>>> request.
>>>>
>>>> In transcriptdb, I get intergenic regions with:
>>>> -----
>>>>> txdb<- makeTranscriptDbFromUCSC(genome = "hg18", tablename = "ensGene")
>>>>> tr<- transcripts(txdb)
>>>>> int<- gaps(tr)
>>>>> int[1:3]
>>>> GRanges with 3 ranges and 0 elementMetadata values
>>>> seqnames ranges strand |
>>>> <Rle> <IRanges> <Rle> |
>>>> [1] chr1 [ 1, 1872] + |
>>>> [2] chr1 [ 3534, 20228] + |
>>>> [3] chr1 [20367, 42911] + |
>>>> ...
>>>> -----
>>>> Now I type chr1:3,534-20,228 in the UCSC browser, hg18, and I see
>>>> that there is a transcript within this interval:
>>>> http://genome.ucsc.edu/cgi-bin/hgc?hgsid=165451654&o=4273&t=19669&g=ensGene&i=ENST00000326632
>>>> <http://genome.ucsc.edu/cgi-bin/hgc?hgsid=165451654&o=4273&t=19669&g=ensGene&i=ENST00000326632>
>>>> <http://genome.ucsc.edu/cgi-bin/hgc?hgsid=165451654&o=4273&t=19669&g=ensGene&i=ENST00000326632
>>>> <http://genome.ucsc.edu/cgi-bin/hgc?hgsid=165451654&o=4273&t=19669&g=ensGene&i=ENST00000326632>>
>>>> This transcript is indeed in the tr object
>>>> -------
>>>>> which(elementMetadata(tr)[,"tx_name"]=="ENST00000326632")
>>>> [1] 3880
>>>>> tr[3880,]
>>>> GRanges with 1 range and 2 elementMetadata values
>>>> seqnames ranges strand | tx_id tx_name
>>>> <Rle> <IRanges> <Rle> |<integer> <character>
>>>> [1] chr1 [4274, 19669] - | 1731 ENST00000326632
>>>>>
>>>> -------
>>>> Is there a bug in 'gaps', or am I missing something about how it works?
>>>>
>>>> I am using the development version of the packages.
>>>>
>>>> Thank you for your help and great software.
>>>>
>>>> Vincent Detours, Ph. D.
>>>> http://homepages.ulb.ac.be/~vdetours/
>>>> <http://homepages.ulb.ac.be/%7Evdetours/>
>>>>
>>>> IRIBHM - Université Libre de Bruxelles
>>>> Bldg C, room C.4.116
>>>> ULB, Campus Erasme, CP602
>>>> 808 route de Lennik
>>>> B-1070 Brussels
>>>> Belgium
>>>>
>>>> Phone: +32-2-555 4220
>>>> Fax: +32-2-555 4655
>>>> Skype: vdetours
>>>> E-mail: vdetours at ulb.ac.be<http://ulb.ac.be>
>>>>
>>>>
>>>>
>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch <mailto:Bioconductor at stat.math.ethz.ch>
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M2-B876
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>> Phone: (206) 667-5791
>> Fax: (206) 667-1319
>
>
>
> Vincent Detours, Ph. D.
> http://homepages.ulb.ac.be/~vdetours/
>
> IRIBHM - Université Libre de Bruxelles
> Bldg C, room C.4.116
> ULB, Campus Erasme, CP602
> 808 route de Lennik
> B-1070 Brussels
> Belgium
>
> Phone: +32-2-555 4220
> Fax: +32-2-555 4655
> Skype: vdetours
> E-mail: vdetours at ulb.ac.be <http://ulb.ac.be>
>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319