[BioC] GRanges - reduce() function

Hervé Pagès hpages at fhcrc.org
Fri Nov 18 21:39:23 CET 2011


Hi Malcolm,

On 11-11-18 12:13 PM, Cook, Malcolm wrote:
> Herve,
>
> re: does anyone object
>
> To be honest, I've been skimming.
>
> Are you proposing that the names of the unlisted GRanges don't get suffixed so as to be unique.

Correct. No more mangling, so yes, they would be repeated. It's not
ideal because if you subset by the gene name then you only get the
first occurence. But the current situation is even worse: you don't
get anything!

>
> If so, then, no objections here.

Good.

H.

>
> Thanks!
>
> ~Malcolm
>
>
>> -----Original Message-----
>> From: bioconductor-bounces at r-project.org [mailto:bioconductor-
>> bounces at r-project.org] On Behalf Of Hervé Pagès
>> Sent: Friday, November 18, 2011 1:38 PM
>> To: Fahim Mohammad
>> Cc: Vincent Carey; Bioconductor Newsgroup
>> Subject: Re: [BioC] GRanges - reduce() function
>>
>> Hi Fahim and others,
>>
>> On 11-11-18 07:45 AM, Fahim Mohammad wrote:
>>> Thanks a lot Vincent,
>>> Your suggestion helped me a lot, but the problem of mangled names and
>>> unsorted ranges remained. Also the 'unlist' function add sufiix in the
>>> name.
>>>
>>>> unlist(reduce(split(gr, elementMetadata(gr)$Gene)))GRanges with 4
>> ranges and 0 elementMetadata values:
>>>           seqnames    ranges strand
>>>              <Rle>   <IRanges>    <Rle>
>>>     www.1     chr2  [10, 19]      *
>>>     xxx.2     chr1  [ 1,  9]      -
>>>     yyy.3     chr1  [ 4, 12]      +
>>>     zzz.4     chr2  [ 1,  9]      -
>>>     ---
>>>     seqlengths:
>>>      chr1 chr2
>>>        NA   NA
>>
>> Yes this mangling is definitely not helping. We have control over the
>> "unlist" method for GRangesList objects so we could change that if
>> nobody objects. I think it was originally implemented that way for 2
>> reasons:
>>
>>     (1) to mimic what base::unlist() does (even though most people
>>         don't like it);
>>
>>     (2) also until recently the names of a GRanges object had to be
>>         unique.
>>
>> Now that the names of a GRanges object don't have to be unique anymore
>> (since BioC 2.9), maybe we could change the behaviour of unlist().
>> Does anybody object?
>>
>>>
>>> So I ended up calling GRanges function twice. I know that this approach
>> may
>>> not be the best and there may be other efficient workaround.
>>> Thanks again for your help.
>>> Here is what I did.
>>>
>>>
>>>>         gr<- GRanges(+                     seqnames = Rle(c("chr1", "chr2", "chr2"),
>> c(6, 2,2)), +                     ranges = IRanges(c(1:6, 1:2, 10:11) , end = c(7:12, 5,9,
>> 18:19 )), +                     strand = Rle(strand(c("-", "+", "-", "*")),c(3, 3, 2,2)), +
>> transcript  = head(letters, 10),+                     Gene = c(rep("xxx",3),
>> rep("yyy",3), rep("zzz",2), rep("www",2)) +                     )>   grGRanges with 10
>> ranges and 2 elementMetadata values:
>>>          seqnames    ranges strand |  transcript        Gene
>>>             <Rle>   <IRanges>    <Rle>   |<character>   <character>
>>>      [1]     chr1  [ 1,  7]      - |           a         xxx
>>>      [2]     chr1  [ 2,  8]      - |           b         xxx
>>>      [3]     chr1  [ 3,  9]      - |           c         xxx
>>>      [4]     chr1  [ 4, 10]      + |           d         yyy
>>>      [5]     chr1  [ 5, 11]      + |           e         yyy
>>>      [6]     chr1  [ 6, 12]      + |           f         yyy
>>>      [7]     chr2  [ 1,  5]      - |           g         zzz
>>>      [8]     chr2  [ 2,  9]      - |           h         zzz
>>>      [9]     chr2  [10, 18]      * |           i         www
>>>     [10]     chr2  [11, 19]      * |           j         www
>>>     ---
>>>     seqlengths:
>>>      chr1 chr2
>>>        NA   NA>         reducedRanges = reduce(split(gr,
>>> elementMetadata(gr)$Gene)) #reduce the ranges according to 'Gene'
>>>
>>> # Convert the rr (reduced ranges) to data frame
>>>
>>>
>>>>         tmp = as.data.frame(reducedRanges)>   tmp  element seqnames start
>> end width strand
>>> 1     www     chr2    10  19    10      *
>>> 2     xxx     chr1     1   9     9      -
>>> 3     yyy     chr1     4  12     9      +
>>> 4     zzz     chr2     1   9     9      ->   rr =
>>> tmp[order(tmp[,'seqnames'], tmp[,'start']),];   #sort the data frame
>>> first on 'seqname' and 'start' location>   rr  element seqnames start
>>> end width strand
>>> 2     xxx     chr1     1   9     9      -
>>> 3     yyy     chr1     4  12     9      +
>>> 4     zzz     chr2     1   9     9      -
>>> 1     www     chr2    10  19    10      *
>>>
>>> # And now convert the data frame again into GRanges object.
>>>
>>>
>>>>         grFinal<- GRanges(seqnames = Rle(rr[,'seqnames']),+
>> ranges = IRanges (start =rr[,'start'], end = rr[,'end'] ),+                          strand =
>> Rle(rr[,'strand']),+                          Gene  = rr[,'element']+                          )>
>> grFinalGRanges with 4 ranges and 1 elementMetadata value:
>>>         seqnames    ranges strand |        Gene
>>>            <Rle>   <IRanges>    <Rle>   |<character>
>>>     [1]     chr1  [ 1,  9]      - |         xxx
>>>     [2]     chr1  [ 4, 12]      + |         yyy
>>>     [3]     chr2  [ 1,  9]      - |         zzz
>>>     [4]     chr2  [10, 19]      * |         www
>>>     ---
>>>     seqlengths:
>>>      chr1 chr2
>>>        NA   NA
>>
>> Alternatively you could do:
>>
>>     grl<- reduce(split(gr, elementMetadata(gr)$Gene))
>>     reducedRanges<- unlist(grl, use.names=FALSE)
>>     elementMetadata(reducedRanges)$Gene<- rep(names(grl),
>> elementLengths(grl))
>>
>>   >  reducedRanges
>> GRanges with 4 ranges and 1 elementMetadata value:
>>         seqnames    ranges strand |        Gene
>>            <Rle>  <IRanges>   <Rle>  |<character>
>>     [1]     chr2  [10, 19]      * |         www
>>     [2]     chr1  [ 1,  9]      - |         xxx
>>     [3]     chr1  [ 4, 12]      + |         yyy
>>     [4]     chr2  [ 1,  9]      - |         zzz
>>     ---
>>     seqlengths:
>>      chr1 chr2
>>        NA   NA
>>
>> Then if you want to restore the original order:
>>
>>     oo<- order(match(elementMetadata(reducedRanges)$Gene,
>>                 unique(elementMetadata(gr)$Gene)))
>>     grFinal<- reducedRanges[oo]
>>
>>   >  grFinal
>> GRanges with 4 ranges and 1 elementMetadata value:
>>         seqnames    ranges strand |        Gene
>>            <Rle>  <IRanges>   <Rle>  |<character>
>>     [1]     chr1  [ 1,  9]      - |         xxx
>>     [2]     chr1  [ 4, 12]      + |         yyy
>>     [3]     chr2  [ 1,  9]      - |         zzz
>>     [4]     chr2  [10, 19]      * |         www
>>     ---
>>     seqlengths:
>>      chr1 chr2
>>        NA   NA
>>
>> Or if you want to order by seqnames and start:
>>
>>     oo2<- order(as.vector(seqnames(reducedRanges)), start(reducedRanges))
>>
>>   >  reducedRanges[oo2]
>> GRanges with 4 ranges and 1 elementMetadata value:
>>         seqnames    ranges strand |        Gene
>>            <Rle>  <IRanges>   <Rle>  |<character>
>>     [1]     chr1  [ 1,  9]      - |         xxx
>>     [2]     chr1  [ 4, 12]      + |         yyy
>>     [3]     chr2  [ 1,  9]      - |         zzz
>>     [4]     chr2  [10, 19]      * |         www
>>     ---
>>     seqlengths:
>>      chr1 chr2
>>        NA   NA
>>
>> which in your case happens to give the same result.
>>
>> Cheers,
>> H.
>>
>>>
>>>
>>>
>>> Regards.
>>> Fahim
>>>
>>> On Thu, Nov 17, 2011 at 9:13 PM, Vincent Carey
>>> <stvjc at channing.harvard.edu>wrote:
>>>
>>>>
>>>>
>>>> On Thu, Nov 17, 2011 at 6:28 PM, Fahim
>> Mohammad<fahim.md at gmail.com>wrote:
>>>>
>>>>> Hi
>>>>> In the following example, I am trying to use 'reduce()' function to reduce
>>>>> the genomic intervals and find intervals corresponding to  'Gene'. It is
>>>>> behaving as it is desired. However the output of the reduce just give the
>>>>> intervals (I am losing the 'Gene' metadata).
>>>>> Is there a way to retain 'Gene' metadata.
>>>>>
>>>>>
>>>>>>         gr<- GRanges(+                     seqnames = Rle(c("chr1",
>>>>> "chr2", "chr2"), c(6, 2,2)), +                     ranges = IRanges(c(1:6,
>>>>> 1:2, 10:11) , end = c(7:12, 5,9, 18:19 )), +                     strand =
>>>>> Rle(strand(c("-", "+", "-", "*")),c(3, 3, 2,2)), +
>>>>> transcript  = head(letters, 10),+                     Gene =
>>>>> c(rep("xxx",3), rep("yyy",3), rep("zzz",2), rep("www",2)) +
>>>>>       )>         grGRanges with 10 ranges and 2 elementMetadata values:
>>>>>
>>>>>         seqnames    ranges strand |  transcript        Gene
>>>>>            <Rle>   <IRanges>    <Rle>   |<character>   <character>
>>>>>     [1]     chr1  [ 1,  7]      - |           a         xxx
>>>>>     [2]     chr1  [ 2,  8]      - |           b         xxx
>>>>>     [3]     chr1  [ 3,  9]      - |           c         xxx
>>>>>     [4]     chr1  [ 4, 10]      + |           d         yyy
>>>>>     [5]     chr1  [ 5, 11]      + |           e         yyy
>>>>>     [6]     chr1  [ 6, 12]      + |           f         yyy
>>>>>     [7]     chr2  [ 1,  5]      - |           g         zzz
>>>>>     [8]     chr2  [ 2,  9]      - |           h         zzz
>>>>>     [9]     chr2  [10, 18]      * |           i         www
>>>>>    [10]     chr2  [11, 19]      * |           j         www
>>>>>    ---
>>>>>    seqlengths:
>>>>>     chr1 chr2
>>>>>       NA   NA>         x = reduce(gr)>         xGRanges with 4 ranges and 0
>>>>>
>>>>> elementMetadata values:
>>>>>        seqnames    ranges strand
>>>>>           <Rle>   <IRanges>    <Rle>
>>>>>    [1]     chr1  [ 4, 12]      +
>>>>>    [2]     chr1  [ 1,  9]      -
>>>>>    [3]     chr2  [ 1,  9]      -
>>>>>    [4]     chr2  [10, 19]      *
>>>>>    ---
>>>>>    seqlengths:
>>>>>     chr1 chr2
>>>>>       NA   NA
>>>>>
>>>>
>>>> This doesn't do exactly what you seek, but it gives a mangled names
>>>> attributed that you can unmangle
>>>> and reattach as metadata if you like
>>>>
>>>> unlist(reduce(split(gr, elementMetadata(gr)$Gene)))
>>>>
>>>>
>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>> Fahim
>>>>>
>>>>> --
>>>>> -------------------------------
>>>>> Fahim Mohammad
>>>>> Bioinforformatics Lab
>>>>> University of Louisville
>>>>> Louisville, KY, USA
>>>>>
>>>>>          [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list