[BioC] iranges

Mon Aug 11 18:21:19 CEST 2014

Hi,

On 08/09/14 04:37, carol white wrote:
> 1- suppose, I have seq with start and end that for some seq start > end
> and for some other start < end. can I create a GRanges with strand being
> defined as + if start > end and - otherwise and take start and end as
> they (on reverse strand, start > end and on forward, start < end) or
> should I swap start and end?

The convention for storing negative ranges in a GRanges object is the 
same a positive; smallest ranges first. For example, the second and 
third elements of 'tx':

library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
tx <- transcriptsBy(TxDb.Dmelanogaster.UCSC.dm3.ensGene, "gene")
 > tx[2:3]
GRangesList of length 2:
$FBgn0000008
GRanges with 3 ranges and 2 metadata columns:
       seqnames               ranges strand |     tx_id     tx_name
          <Rle>            <IRanges>  <Rle> | <integer> <character>
   [1]    chr2R [18024494, 18060339]      + |      7681 FBtr0100521
   [2]    chr2R [18024496, 18060346]      + |      7682 FBtr0071763
   [3]    chr2R [18024938, 18060346]      + |      7683 FBtr0071764

$FBgn0000014
GRanges with 4 ranges and 2 metadata columns:
       seqnames               ranges strand | tx_id     tx_name
   [1]    chr3R [12632936, 12655767]      - | 21863 FBtr0306337
   [2]    chr3R [12633349, 12653845]      - | 21864 FBtr0083388
   [3]    chr3R [12633349, 12655300]      - | 21865 FBtr0083387
   [4]    chr3R [12633349, 12655474]      - | 21866 FBtr0300485

IRanges / GRanges don't hold negative width ranges so you can't create a 
range with start > width.

 > GRanges("chr1", IRanges(10, 5))
Error in .Call2("solve_user_SEW0", start, end, width, PACKAGE = 
"IRanges") :
   solving row 1: negative widths are not allowed

The one exception is a zero-width range but that's not applicable here:
 > width(GRanges("chr1", IRanges(4,3)))
[1] 0

>
> 2- if I use resize with your example, why the start of the output of
> resize is the same gene's start although I take 10bp from the end?

resize() is strand-aware. In the GRanges the ranges are ordered left to 
right (smallest to largest). Because transcription is from 3' to 5' for 
neg ranges, these end values

 > end(gene)
[1] 12655767 12653845 12655300 12655474

are considered the start values. You can see the difference when we 
change the strand.

strand(gene) <- "+"
 > resize(gene, 10, fix="end")
GRanges with 4 ranges and 2 metadata columns:
       seqnames               ranges strand |     tx_id     tx_name
          <Rle>            <IRanges>  <Rle> | <integer> <character>
   [1]    chr3R [12655758, 12655767]      + |     21863 FBtr0306337
   [2]    chr3R [12653836, 12653845]      + |     21864 FBtr0083388
   [3]    chr3R [12655291, 12655300]      + |     21865 FBtr0083387
   [4]    chr3R [12655465, 12655474]      + |     21866 FBtr0300485

>
> resize(gene, 10, fix="end")
> GRanges with 4 ranges and 2 metadata columns:
>        seqnames               ranges strand |     tx_id     tx_name
>           <Rle>            <IRanges>  <Rle> | <integer> <character>
>    [1]    chr3R [12632936, 12632945]      - |     21863 FBtr0306337
>    [2]    chr3R [12633349, 12633358]      - |     21864 FBtr0083388
>    [3]    chr3R [12633349, 12633358]      - |     21865 FBtr0083387
>    [4]    chr3R [12633349, 12633358]      - |     21866 FBtr0300485
>    ---
> seqlengths:
>         chr2L     chr2R     chr3L     chr3R ...   chrXHet   chrYHet
> chrUextra
>      23011544  21146708  24543557  27905053 ...    204112    347038
> 29004656
>> gene
> GRanges with 4 ranges and 2 metadata columns:
>        seqnames               ranges strand |     tx_id     tx_name
>           <Rle>            <IRanges>  <Rle> | <integer> <character>
> [1]    chr3R [12632936, 12655767]      - |     21863 FBtr0306337
>    [2]    chr3R [12633349, 12653845]      - |     21864 FBtr0083388
>    [3]    chr3R [12633349, 12655300]      - |     21865 FBtr0083387
>    [4]    chr3R [12633349, 12655474]      - |     21866 FBtr0300485
>    ---
>    seqlengths:
>         chr2L     chr2R     chr3L     chr3R ...   chrXHet   chrYHet
> chrUextra
>      23011544  21146708  24543557  27905053 ...    204112    347038
> 29004656
> ####################################################
> 3- why do I get err ms in using narrow
>
>   narrow(gene, width=10, end= end(ranges(gene)))
> Error in .Call2("solve_user_SEW", refwidths, start, end, width,
> translate.negative.coord,  :
>    solving row 1: 'allow.nonnarrowing' is FALSE and the supplied end
> (12655767) is > refwidth

'start' and 'end' are integers specifying the distance from the current 
start and end. 'end = -3' defines the end as 3 less than the current 
end; 'start = 5' defines the start as 5 greater than the current start.

It's tricky to get ranges of length 10 anchored at the end with narrow():

newStart <- (end(gene) - 9) - (start(gene) - 1)
narrowGene <- narrow(gene, start = newStart)
width(narrowGene)
 > width(narrowGene)
[1] 10 10 10 10

Several functions on the inter-range-methods man page are similar and 
you can often get the same result with different functions. In this 
case, it may be more straightforward to use restrict():

 > restrictGene <- restrict(gene, start=end(gene) - 9, end=end(gene))
 > width(restrictGene)
[1] 10 10 10 10
 > restrictGene
GRanges with 4 ranges and 2 metadata columns:
       seqnames               ranges strand |     tx_id     tx_name
          <Rle>            <IRanges>  <Rle> | <integer> <character>
   [1]    chr3R [12655758, 12655767]      - |     21863 FBtr0306337
   [2]    chr3R [12653836, 12653845]      - |     21864 FBtr0083388
   [3]    chr3R [12655291, 12655300]      - |     21865 FBtr0083387
   [4]    chr3R [12655465, 12655474]      - |     21866 FBtr0300485

Valerie

>
> Regards,
>
>
> On Friday, August 8, 2014 6:11 PM, Valerie Obenchain
> <vobencha at fhcrc.org> wrote:
>
>
> Hi,
>
> Please 'reply all' when responding so communication stays on the list.
>
> If you are working with stranded ranges you should use the GRanges
> container. IRanges is not strand-aware and does not have a strand
> argument. You can see the function signature on the man page by typing
>
> ?IRanges
>
>  > Usage:
>  >
>  >      ## IRanges constructor:
>  >      IRanges(start=NULL, end=NULL, width=NULL, names=NULL)
>  >
>
> Load a Transcript Db object and extract transcripts by gene:
> library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
> tx <- transcriptsBy(TxDb.Dmelanogaster.UCSC.dm3.ensGene, "gene")
>
> Select a gene with transcripts on the negative strand:
> gene <- tx[[3]]
>  >> gene
>  > GRanges with 4 ranges and 2 metadata columns:
>  >      seqnames              ranges strand |    tx_id    tx_name
>  >          <Rle>            <IRanges>  <Rle> | <integer> <character>
>  >  [1]    chr3R [12632936, 12655767]      - |    21863 FBtr0306337
>  >  [2]    chr3R [12633349, 12653845]      - | 21864 FBtr0083388
>  >  [3]    chr3R [12633349, 12655300]      - |    21865 FBtr0083387
>  >  [4]    chr3R [12633349, 12655474]      - |    21866 FBtr0300485
>
> GRanges can be manipulated with resize(), trim(), shift(), flank(),
> narrow() and several other methods. To see them type (with the quotes)
>
> ?`intra-range-methods`
>
> and select the page for GRanges. It sounds like resize() is what you're
> looking for.
>
> resize(gene, width = 10)
>  >> resize(gene, width = 10)
>  > GRanges with 4 ranges and 2 metadata columns:
>  >      seqnames              ranges strand |    tx_id    tx_name
>  >          <Rle>            <IRanges>  <Rle> | <integer> <character>
>  >  [1]    chr3R [12655758, 12655767]      - |    21863 FBtr0306337
>  >  [2]    chr3R [12653836, 12653845]      - |    21864 FBtr0083388
>  >  [3]    chr3R [12655291, 12655300]      - |    21865 FBtr0083387
>  >  [4]    chr3R [12655465, 12655474]      - |    21866 FBtr0300485
>
> If you have sequence data instead of range data, the XStringSet family
> is more appropriate. For examples of manipulating sequences see Section
> E on the XStringSet man page. The functions you want are narrow() or
> subseq().
>
> library(Biostrings)
> ?XStringSet
>
>
> Valerie
>
>
> On 08/08/2014 08:38 AM, carol white wrote:
>  > I have the problem when i want to take the width from the end of a
>  > sequence on a reverse strand.
>  > if I take the nucleotide seq of a gene that is on the reverse strand on
>  > the ncbi web site and extract for ex 10 or 20 bp from the end, i don't
>  > get the same as I do with iranges. As I have already given the strand as
>  > the parameter to the iranges function, I assume that it has already
>  > reverse-complemented by iranges. I don't have this problem with the
>  > genes that are on the forward strand nor when I take the sub sequence
>  > from the beginning of the sequence.
>  >
>  > Regards,
>  > On Friday, August 8, 2014 5:28 PM, Valerie Obenchain
>  > <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>> wrote:
>  >
>  >
>  > Did you provide 'start', 'end' and 'width' and get a confusing answer?
>  > If yes, please show your example.
>  >
>  > Thanks.
>  > Valerie
>  >
>  >
>  >
>  > On 08/08/2014 08:23 AM, Valerie Obenchain wrote:
>  >  > Hi Carol,
>  >  >
>  >  > The 'end' is the end of the range. When you specify ranges with 'end'
>  >  > and 'width' the range will always end at the 'end' value.
>  >  >
>  >  >  > IRanges(end = 10, width = c(5, 10))
>  >  > IRanges of length 2
>  >  >      start end width
>  >  > [1]    6  10    5
>  >  > [2]    1  10    10
>  >  >
>  >  >
>  >  > Similar reasoning for 'start' and 'width':
>  >  >
>  >  >  > IRanges(start = 10, width = c(5, 10))
>  >  > IRanges of length 2
>  >  >      start end width
>  >  > [1]    10  14    5
>  >  > [2]    10  19    10
>  >  >
>  >  >
>  >  > Valerie
>  >  >
>  >  >
>  >  >
>  >  > On 08/08/2014 01:29 AM, carol white wrote:
>  >  >> Hi,
>  >  >> How does width with start and end in IRanges work? I thought that
> if I
>  >  >> use the end with a width, then the sequence from the end with the
>  >  >> length of width is taken. However, in my case when I use width for ex
>  >  >> 20 and 10, the corresponding sequences with the length 20 and 10 are
>  >  >> not the same from the end but from the beginning. Did I misunderstood
>  >  >> some thing?
>  >  >>
>  >  >> Regards,
>  >  >>
>  >  >> Carol
>  >  >>    [[alternative HTML version deleted]]
>  >  >>
>  >  >> _______________________________________________
>  >  >> Bioconductor mailing list
>  >  >> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> <mailto:Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>>
>  >  >> https://stat.ethz.ch/mailman/listinfo/bioconductor
>  >  >> Search the archives:
>  >  >> http://news.gmane.org/gmane.science.biology.informatics.conductor
>  >  >>
>  >  >
>  >  >
>  >
>  >
>  > --
>  > Valerie Obenchain
>  > Program in Computational Biology
>  > Fred Hutchinson Cancer Research Center
>  > 1100 Fairview Ave. N, Seattle, WA 98109
>  >
>  > Email: vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
> <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>
>
>  > Phone: (206) 667-3158
>  >
>  >
>
>
>