[BioC] [devteam-bioc] GenomicRanges seqlengths problem

Hervé Pagès hpages at fhcrc.org
Thu May 1 19:27:44 CEST 2014



On 05/01/2014 10:21 AM, Maintainer wrote:
> Hi Sonia, Val,
>
> On 05/01/2014 09:52 AM, Maintainer wrote:
>> Hi,
>>
>> The seqnames are in a different orders in the 'chrs' and 'sl' objects.
>> Looking at the first 3 names in each,
>>
>>    > head(seqlengths(chrs), 3)
>>     lm_SuperContig_0_v2  lm_SuperContig_1_v2 lm_SuperContig_10_v2
>>                      NA                   NA                   NA
>>
>>    > head(sl, 3)
>> lm_SuperContig_0_v2 lm_SuperContig_1_v2 lm_SuperContig_2_v2
>>                4258568             3378610             2939989
>>
>> When a GRanges is created, seqnames are sorted in ascii order.
>
> Just to clarify, the seqlevels are sorted, not the seqnames:
>
>     > gr <- GRanges(c("b", "a", "b"), IRanges(1:3, 10))
>     > gr
>     GRanges with 3 ranges and 0 metadata columns:
>           seqnames    ranges strand
>              <Rle> <IRanges>  <Rle>
>       [1]        b   [1, 10]      *
>       [2]        a   [2, 10]      *
>       [3]        b   [3, 10]      *
>       ---
>       seqlengths:
>         a  b
>        NA NA
>
> Not sorted (i.e. user-supplied order is preserved):
>
>     > seqnames(gr)
>     factor-Rle of length 3 with 3 runs
>       Lengths: 1 1 1
>       Values : b a b
>     Levels(2): a b
>
> Sorted:
>
>     > seqlevels(gr)
>     [1] "a" "b"
>
> This sorting of the seqlevels is not good anyway (people with different
> LOCALE will get different results). I just fixed this so now the
> GRanges() constructor will preserve the seqlevels in the order supplied
> by the user:
>
>     > gr <- GRanges(c("b", "a", "b"), IRanges(1:3, 10))
>     > seqlevels(gr)
>     [1] "b" "a"
>
> More precisely, the seqlevels are obtained by doing unique() on the
> seqnames.
>
> The fix will propagate to the public repos and become available thru
> biocLite() in the next 24 hours.
>
> Then your original code should just work Sonia.

Or maybe not :) You might still need to use 'stringsAsFactors=FALSE'
when calling read.table(). Please let us know if that still doesn't
make it.

Thanks,
H.

>
> Cheers,
> H.
>
>
>> You can
>> see the full list with seqinfo():
>>
>>    > seqinfo(chrs)
>> Seqinfo of length 34
>> seqnames             seqlengths isCircular genome
>> lm_SuperContig_0_v2        <NA>       <NA>   <NA>
>> lm_SuperContig_1_v2        <NA>       <NA>   <NA>
>> lm_SuperContig_10_v2       <NA>       <NA>   <NA>
>> lm_SuperContig_11_v2       <NA>       <NA>   <NA>
>> lm_SuperContig_12_v2       <NA>       <NA>   <NA>
>> ...                         ...        ...    ...
>>
>> The seqnames in the replacement vector must match the order in the
>> 'Seqinfo' object.
>>
>> To re-order:
>>
>> sl_new <- sl[match(levels(factor(names(sl))), names(sl))]
>>
>> Then add the new lengths:
>>
>>>> seqlengths(chrs) <- sl_new
>>>> chrs
>>> GRanges with 34 ranges and 0 metadata columns:
>>>                      seqnames       ranges strand
>>>                         <Rle>    <IRanges>  <Rle>
>>>      [1]  lm_SuperContig_0_v2 [1, 4258568]      *
>>>      [2]  lm_SuperContig_1_v2 [1, 3378610]      *
>>>      [3]  lm_SuperContig_2_v2 [1, 2939989]      *
>>>      [4]  lm_SuperContig_3_v2 [1, 2348246]      *
>>>      [5]  lm_SuperContig_4_v2 [1, 1918205]      *
>>>      ...                  ...          ...    ...
>>>     [30] lm_SuperContig_29_v2  [1, 200940]      *
>>>     [31] lm_SuperContig_30_v2  [1, 154863]      *
>>>     [32] lm_SuperContig_31_v2  [1, 143268]      *
>>>     [33] lm_SuperContig_32_v2  [1,  87679]      *
>>>     [34] lm_SuperContig_34_v2  [1,  58596]      *
>>>     ---
>>>     seqlengths:
>>>       lm_SuperContig_0_v2  lm_SuperContig_1_v2 ...  lm_SuperContig_9_v2
>>>                   4258568              3378610 ...              1772623
>>
>>
>> Valerie
>>
>>
>> On 05/01/2014 07:20 AM, Maintainer wrote:
>>>
>>> Hello,
>>>
>>> I have a problem with supplying GRanges object with seqlengths.
>>> I have a files.
>>> chrs.txt - contains information about chromosomes
>>>
>>> chrs.txt
>>> chr,start,end,len
>>> lm_SuperContig_0_v2,1,4258568,4258568
>>> lm_SuperContig_1_v2,1,3378610,3378610
>>> lm_SuperContig_2_v2,1,2939989,2939989
>>> lm_SuperContig_3_v2,1,2348246,2348246
>>> lm_SuperContig_4_v2,1,1918205,1918205
>>> lm_SuperContig_6_v2,1,1888674,1888674
>>> lm_SuperContig_5_v2,1,1869450,1869450
>>> lm_SuperContig_8_v2,1,1809296,1809296
>>> lm_SuperContig_9_v2,1,1772623,1772623
>>> lm_SuperContig_7_v2,1,1769547,1769547
>>> lm_SuperContig_10_v2,1,1758670,1758670
>>> lm_SuperContig_13_v2,1,1634580,1634580
>>> lm_SuperContig_12_v2,1,1631710,1631710
>>> lm_SuperContig_11_v2,1,1590160,1590160
>>> lm_SuperContig_15_v2,1,1560629,1560629
>>> lm_SuperContig_14_v2,1,1533332,1533332
>>> lm_SuperContig_17_v2,1,1445693,1445693
>>> lm_SuperContig_16_v2,1,1397653,1397653
>>> lm_SuperContig_18_v2,1,1351976,1351976
>>> lm_SuperContig_19_v2,1,1186800,1186800
>>> lm_SuperContig_20_v2,1,1087932,1087932
>>> lm_SuperContig_21_v2,1,1020521,1020521
>>> lm_SuperContig_22_v2,1,731443,731443
>>> lm_SuperContig_23_v2,1,521426,521426
>>> lm_SuperContig_24_v2,1,475869,475869
>>> lm_SuperContig_25_v2,1,318058,318058
>>> lm_SuperContig_26_v2,1,261540,261540
>>> lm_SuperContig_27_v2,1,250629,250629
>>> lm_SuperContig_28_v2,1,236098,236098
>>> lm_SuperContig_29_v2,1,200940,200940
>>> lm_SuperContig_30_v2,1,154863,154863
>>> lm_SuperContig_31_v2,1,143268,143268
>>> lm_SuperContig_32_v2,1,87679,87679
>>> lm_SuperContig_34_v2,1,58596,58596
>>>
>>> I try to do the following:
>>> # creating GRanges object for chromosomes
>>> dataChr <- read.table("chrs.txt",header=T,sep=",")
>>> chrs <- with(dataChr, GRanges(chr, IRanges(start, end)))
>>> sl <- setNames(dataChr$len, as.character(dataChr$chr))
>>> seqlengths(chrs) <- sl
>>>
>>> And I get the following error:
>>>
>>> Error in .normargSeqlengths(value, seqnames(x)) :
>>>      when the supplied 'seqlengths' vector is named, the names must match the seqnames
>>>
>>> Any chance you could help me with what is going on?
>>>
>>> Best wishes,
>>> Agnieszka
>>>
>>>
>>>     -- output of sessionInfo():
>>>
>>> R version 3.1.0 (2014-04-10)
>>> Platform: i386-w64-mingw32/i386 (32-bit)
>>>
>>> locale:
>>> [1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252
>>> [4] LC_NUMERIC=C                            LC_TIME=English_United Kingdom.1252
>>>
>>> attached base packages:
>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> other attached packages:
>>> [1] XVector_0.4.0        ggbio_1.12.3         ggplot2_0.9.3.1      GenomicRanges_1.16.2 GenomeInfoDb_1.0.2   IRanges_1.22.4       BiocGenerics_0.10.0
>>> [8] BiocInstaller_1.14.2
>>>
>>> loaded via a namespace (and not attached):
>>>     [1] AnnotationDbi_1.26.0     BatchJobs_1.2            BBmisc_1.6               Biobase_2.24.0           BiocParallel_0.6.0       biomaRt_2.20.0
>>>     [7] Biostrings_2.32.0        biovizBase_1.12.1        bitops_1.0-6             brew_1.0-6               BSgenome_1.32.0          cluster_1.15.2
>>> [13] codetools_0.2-8          colorspace_1.2-4         DBI_0.2-7                dichromat_2.0-0          digest_0.6.4             fail_1.2
>>> [19] foreach_1.4.2            Formula_1.1-1            GenomicAlignments_1.0.0  GenomicFeatures_1.16.0   grid_3.1.0               gridExtra_0.9.1
>>> [25] gtable_0.1.2             Hmisc_3.14-4             iterators_1.0.7          labeling_0.2             lattice_0.20-29          latticeExtra_0.6-26
>>> [31] MASS_7.3-31              munsell_0.4.2            plyr_1.8.1               proto_0.3-10             RColorBrewer_1.0-5       Rcpp_0.11.1
>>> [37] RCurl_1.95-4.1           reshape2_1.4             Rsamtools_1.16.0         RSQLite_0.11.4           rtracklayer_1.24.0       scales_0.2.4
>>> [43] sendmailR_1.1-2          splines_3.1.0            stats4_3.1.0             stringr_0.6.2            survival_2.37-7          tools_3.1.0
>>> [49] VariantAnnotation_1.10.0 XML_3.98-1.1             zlibbioc_1.10.0
>>>
>>>
>>> --
>>> Sent via the guest posting facility at bioconductor.org.
>>>
>>> ________________________________________________________________________
>>> devteam-bioc mailing list
>>> To unsubscribe from this mailing list send a blank email to
>>> devteam-bioc-leave at lists.fhcrc.org
>>> You can also unsubscribe or change your personal options at
>>> https://lists.fhcrc.org/mailman/listinfo/devteam-bioc
>>>
>>
>>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list