[BioC] [devteam-bioc] GenomicRanges seqlengths problem

Hervé Pagès hpages at fhcrc.org
Thu May 1 19:21:26 CEST 2014


Hi Sonia, Val,

On 05/01/2014 09:52 AM, Maintainer wrote:
> Hi,
>
> The seqnames are in a different orders in the 'chrs' and 'sl' objects.
> Looking at the first 3 names in each,
>
>   > head(seqlengths(chrs), 3)
>    lm_SuperContig_0_v2  lm_SuperContig_1_v2 lm_SuperContig_10_v2
>                     NA                   NA                   NA
>
>   > head(sl, 3)
> lm_SuperContig_0_v2 lm_SuperContig_1_v2 lm_SuperContig_2_v2
>               4258568             3378610             2939989
>
> When a GRanges is created, seqnames are sorted in ascii order.

Just to clarify, the seqlevels are sorted, not the seqnames:

   > gr <- GRanges(c("b", "a", "b"), IRanges(1:3, 10))
   > gr
   GRanges with 3 ranges and 0 metadata columns:
         seqnames    ranges strand
            <Rle> <IRanges>  <Rle>
     [1]        b   [1, 10]      *
     [2]        a   [2, 10]      *
     [3]        b   [3, 10]      *
     ---
     seqlengths:
       a  b
      NA NA

Not sorted (i.e. user-supplied order is preserved):

   > seqnames(gr)
   factor-Rle of length 3 with 3 runs
     Lengths: 1 1 1
     Values : b a b
   Levels(2): a b

Sorted:

   > seqlevels(gr)
   [1] "a" "b"

This sorting of the seqlevels is not good anyway (people with different
LOCALE will get different results). I just fixed this so now the
GRanges() constructor will preserve the seqlevels in the order supplied
by the user:

   > gr <- GRanges(c("b", "a", "b"), IRanges(1:3, 10))
   > seqlevels(gr)
   [1] "b" "a"

More precisely, the seqlevels are obtained by doing unique() on the
seqnames.

The fix will propagate to the public repos and become available thru
biocLite() in the next 24 hours.

Then your original code should just work Sonia.

Cheers,
H.


> You can
> see the full list with seqinfo():
>
>   > seqinfo(chrs)
> Seqinfo of length 34
> seqnames             seqlengths isCircular genome
> lm_SuperContig_0_v2        <NA>       <NA>   <NA>
> lm_SuperContig_1_v2        <NA>       <NA>   <NA>
> lm_SuperContig_10_v2       <NA>       <NA>   <NA>
> lm_SuperContig_11_v2       <NA>       <NA>   <NA>
> lm_SuperContig_12_v2       <NA>       <NA>   <NA>
> ...                         ...        ...    ...
>
> The seqnames in the replacement vector must match the order in the
> 'Seqinfo' object.
>
> To re-order:
>
> sl_new <- sl[match(levels(factor(names(sl))), names(sl))]
>
> Then add the new lengths:
>
>>> seqlengths(chrs) <- sl_new
>>> chrs
>> GRanges with 34 ranges and 0 metadata columns:
>>                     seqnames       ranges strand
>>                        <Rle>    <IRanges>  <Rle>
>>     [1]  lm_SuperContig_0_v2 [1, 4258568]      *
>>     [2]  lm_SuperContig_1_v2 [1, 3378610]      *
>>     [3]  lm_SuperContig_2_v2 [1, 2939989]      *
>>     [4]  lm_SuperContig_3_v2 [1, 2348246]      *
>>     [5]  lm_SuperContig_4_v2 [1, 1918205]      *
>>     ...                  ...          ...    ...
>>    [30] lm_SuperContig_29_v2  [1, 200940]      *
>>    [31] lm_SuperContig_30_v2  [1, 154863]      *
>>    [32] lm_SuperContig_31_v2  [1, 143268]      *
>>    [33] lm_SuperContig_32_v2  [1,  87679]      *
>>    [34] lm_SuperContig_34_v2  [1,  58596]      *
>>    ---
>>    seqlengths:
>>      lm_SuperContig_0_v2  lm_SuperContig_1_v2 ...  lm_SuperContig_9_v2
>>                  4258568              3378610 ...              1772623
>
>
> Valerie
>
>
> On 05/01/2014 07:20 AM, Maintainer wrote:
>>
>> Hello,
>>
>> I have a problem with supplying GRanges object with seqlengths.
>> I have a files.
>> chrs.txt - contains information about chromosomes
>>
>> chrs.txt
>> chr,start,end,len
>> lm_SuperContig_0_v2,1,4258568,4258568
>> lm_SuperContig_1_v2,1,3378610,3378610
>> lm_SuperContig_2_v2,1,2939989,2939989
>> lm_SuperContig_3_v2,1,2348246,2348246
>> lm_SuperContig_4_v2,1,1918205,1918205
>> lm_SuperContig_6_v2,1,1888674,1888674
>> lm_SuperContig_5_v2,1,1869450,1869450
>> lm_SuperContig_8_v2,1,1809296,1809296
>> lm_SuperContig_9_v2,1,1772623,1772623
>> lm_SuperContig_7_v2,1,1769547,1769547
>> lm_SuperContig_10_v2,1,1758670,1758670
>> lm_SuperContig_13_v2,1,1634580,1634580
>> lm_SuperContig_12_v2,1,1631710,1631710
>> lm_SuperContig_11_v2,1,1590160,1590160
>> lm_SuperContig_15_v2,1,1560629,1560629
>> lm_SuperContig_14_v2,1,1533332,1533332
>> lm_SuperContig_17_v2,1,1445693,1445693
>> lm_SuperContig_16_v2,1,1397653,1397653
>> lm_SuperContig_18_v2,1,1351976,1351976
>> lm_SuperContig_19_v2,1,1186800,1186800
>> lm_SuperContig_20_v2,1,1087932,1087932
>> lm_SuperContig_21_v2,1,1020521,1020521
>> lm_SuperContig_22_v2,1,731443,731443
>> lm_SuperContig_23_v2,1,521426,521426
>> lm_SuperContig_24_v2,1,475869,475869
>> lm_SuperContig_25_v2,1,318058,318058
>> lm_SuperContig_26_v2,1,261540,261540
>> lm_SuperContig_27_v2,1,250629,250629
>> lm_SuperContig_28_v2,1,236098,236098
>> lm_SuperContig_29_v2,1,200940,200940
>> lm_SuperContig_30_v2,1,154863,154863
>> lm_SuperContig_31_v2,1,143268,143268
>> lm_SuperContig_32_v2,1,87679,87679
>> lm_SuperContig_34_v2,1,58596,58596
>>
>> I try to do the following:
>> # creating GRanges object for chromosomes
>> dataChr <- read.table("chrs.txt",header=T,sep=",")
>> chrs <- with(dataChr, GRanges(chr, IRanges(start, end)))
>> sl <- setNames(dataChr$len, as.character(dataChr$chr))
>> seqlengths(chrs) <- sl
>>
>> And I get the following error:
>>
>> Error in .normargSeqlengths(value, seqnames(x)) :
>>     when the supplied 'seqlengths' vector is named, the names must match the seqnames
>>
>> Any chance you could help me with what is going on?
>>
>> Best wishes,
>> Agnieszka
>>
>>
>>    -- output of sessionInfo():
>>
>> R version 3.1.0 (2014-04-10)
>> Platform: i386-w64-mingw32/i386 (32-bit)
>>
>> locale:
>> [1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252
>> [4] LC_NUMERIC=C                            LC_TIME=English_United Kingdom.1252
>>
>> attached base packages:
>> [1] parallel  stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] XVector_0.4.0        ggbio_1.12.3         ggplot2_0.9.3.1      GenomicRanges_1.16.2 GenomeInfoDb_1.0.2   IRanges_1.22.4       BiocGenerics_0.10.0
>> [8] BiocInstaller_1.14.2
>>
>> loaded via a namespace (and not attached):
>>    [1] AnnotationDbi_1.26.0     BatchJobs_1.2            BBmisc_1.6               Biobase_2.24.0           BiocParallel_0.6.0       biomaRt_2.20.0
>>    [7] Biostrings_2.32.0        biovizBase_1.12.1        bitops_1.0-6             brew_1.0-6               BSgenome_1.32.0          cluster_1.15.2
>> [13] codetools_0.2-8          colorspace_1.2-4         DBI_0.2-7                dichromat_2.0-0          digest_0.6.4             fail_1.2
>> [19] foreach_1.4.2            Formula_1.1-1            GenomicAlignments_1.0.0  GenomicFeatures_1.16.0   grid_3.1.0               gridExtra_0.9.1
>> [25] gtable_0.1.2             Hmisc_3.14-4             iterators_1.0.7          labeling_0.2             lattice_0.20-29          latticeExtra_0.6-26
>> [31] MASS_7.3-31              munsell_0.4.2            plyr_1.8.1               proto_0.3-10             RColorBrewer_1.0-5       Rcpp_0.11.1
>> [37] RCurl_1.95-4.1           reshape2_1.4             Rsamtools_1.16.0         RSQLite_0.11.4           rtracklayer_1.24.0       scales_0.2.4
>> [43] sendmailR_1.1-2          splines_3.1.0            stats4_3.1.0             stringr_0.6.2            survival_2.37-7          tools_3.1.0
>> [49] VariantAnnotation_1.10.0 XML_3.98-1.1             zlibbioc_1.10.0
>>
>>
>> --
>> Sent via the guest posting facility at bioconductor.org.
>>
>> ________________________________________________________________________
>> devteam-bioc mailing list
>> To unsubscribe from this mailing list send a blank email to
>> devteam-bioc-leave at lists.fhcrc.org
>> You can also unsubscribe or change your personal options at
>> https://lists.fhcrc.org/mailman/listinfo/devteam-bioc
>>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list