[BioC] BUG in Genomic(Features|Ranges): names(unlist(transcriptsBy(txdb, 'gene'))) is UNRELIABLE!!!

Martin Morgan mtmorgan at fhcrc.org
Sun Sep 2 00:11:00 CEST 2012


On 09/01/2012 12:24 PM, Tim Triche, Jr. wrote:
> Hmm, I was about to say "that's not the way it works in devel!!" but
> there you go.  More generally, I wonder if this couldn't be fixed once
> and for all:
>
> Unlist can be maddening -- I would like to add a version (perhaps to
> BiocGenerics) that uses a .[1:length(x)] extension instead of the
> current default of pasting c('', 1:(length(x)-1)) to the name.
> Personally it seems like this would actually better overall as a
> default, even in base R.  Perhaps I ought to bring up this notion?

BiocGenerics tries not to mess with function signatures; it's used 
widely and so wants to play as nicely as possible with other packages.

Martin

> Any reason not to risk the ire of Professor Ripley again?   Worst case,
> he points out why this is an idiotic idea and I learn something in the
> process.
>
> thanks,
>
> --t
>
>
>
> On Sat, Sep 1, 2012 at 6:35 AM, Martin Morgan <mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>> wrote:
>
>     On 08/31/2012 10:07 PM, Cook, Malcolm wrote:
>
>         Careful fellow travelers,
>
>         I find that unlisting the GenomicRanges returned from a call to
>         `transcriptsBy` returns a list with names that are gene names...
>         only they are incorrect!
>
>         Look:
>
>             txdb<-__makeTranscriptDbFromBiomart(__biomart="ensembl",
>             dataset="dmelanogaster_gene___ensembl")
>
>         ...
>
>             transcriptsBy(txdb,'gene')[2]
>
>         GRangesList of length 1:
>         $FBgn0000008
>         GRanges with 3 ranges and 2 elementMetadata cols:
>                 seqnames               ranges strand |     tx_id     tx_name
>                    <Rle>            <IRanges>  <Rle> | <integer> <character>
>             [1]       2R [18024494, 18060339]      + |      8616 FBtr0100521
>             [2]       2R [18024496, 18060346]      + |      8615 FBtr0071763
>             [3]       2R [18024938, 18060346]      + |      8617 FBtr0071764
>         ...
>
>             unlist(transcriptsBy(txdb,'__gene')[2])
>
>         GRanges with 3 ranges and 2 elementMetadata cols:
>                          seqnames               ranges strand |
>         tx_id     tx_name
>                             <Rle>            <IRanges>  <Rle> |
>         <integer> <character>
>              FBgn0000008       2R [18024494, 18060339]      + |
>           8616 FBtr0100521
>             FBgn00000081       2R [18024496, 18060346]      + |
>           8615 FBtr0071763
>             FBgn00000082       2R [18024938, 18060346]      + |
>           8617 FBtr0071764
>         ...
>
>
>         Arguably, those names on the the GRanges should either all be
>         the same, namely FBgn0000008, or they should not be returned.
>
>
>     This is the way unlist works in base R
>
>      > unlist(list(a=1:2))
>     a1 a2
>       1  2
>
>     but the behavior has been changed in devel (to be release in early
>     October)
>
>      > unlist(GRangesList(A=GRanges("__a", IRanges(1:2, 10))))
>     GRanges with 2 ranges and 0 metadata columns:
>          seqnames    ranges strand
>             <Rle> <IRanges>  <Rle>
>        A        a   [1, 10]      *
>        A        a   [2, 10]      *
>        ---
>        seqlengths:
>          a
>         NA
>
>     the work-around, as in base R, is to add use.names=FALSE to unlist
>     (perhaps adding a metadata column of rep(names(txdb),
>     elementLengths(txdb))).
>
>
>         This 'bug' has been around for a some time.  I meant to report
>         it, and just tripped over it again.
>
>         Can fix?
>
>         Thanks!
>
>         Malcolm
>
>             sessionInfo()
>
>         R version 2.15.0 (2012-03-30)
>         Platform: x86_64-apple-darwin9.8.0/x86___64 (64-bit)
>
>         locale:
>         [1]
>         en_US.UTF-8/en_US.UTF-8/en_US.__UTF-8/C/en_US.UTF-8/en_US.UTF-__8
>
>         attached base packages:
>            [1] tools     splines   parallel  stats     graphics
>           grDevices utils     datasets  methods   base
>
>         other attached packages:
>            [1] igraph_0.6-2          log4r_0.1-4           vwr_0.1
>                  RecordLinkage_0.4-1   ffbase_0.5            ff_2.2-7
>                     bit_1.1-8             evd_2.2-7
>         ipred_0.8-13          prodlim_1.3.1         KernSmooth_2.23-8
>            nnet_7.3-4            survival_2.36-14      mlbench_2.1-1
>              MASS_7.3-20           ada_2.0-3             rpart_3.1-54
>                 e1071_1.6             class_7.3-4
>         XLConnect_0.2-0       XLConnectJars_0.2-0   rJava_0.9-3
>            latticeExtra_0.6-19   RColorBrewer_1.0-5    lattice_0.20-6
>               doMC_1.2.5            multicore_0.1-7
>         [28] SRAdb_1.10.0          RCurl_1.91-1          bitops_1.0-4.1
>                 graph_1.34.0          BSgenome_1.24.0
>         rtracklayer_1.16.3    Rsamtools_1.8.6       Biostrings_2.24.1
>            GenomicFeatures_1.8.2 AnnotationDbi_1.19.31
>         GenomicRanges_1.8.12  R.utils_1.16.0        R.oo_1.9.8
>           R.methodsS3_1.4.2     IRanges_1.14.4        Biobase_2.17.7
>             BiocGenerics_0.3.1    data.table_1.8.2      compare_0.2-3
>                svUnit_0.7-10         doParallel_1.0.1
>           iterators_1.0.6       foreach_1.4.0         ggplot2_0.9.1
>              sqldf_0.4-6.4         RSQLite.extfuns_0.0.1 RSQLite_0.11.1
>         [55] chron_2.3-42          gsubfn_0.6-4          proto_0.3-9.2
>                DBI_0.2-5             functional_0.1        reshape_0.8.4
>                  plyr_1.7.1            stringr_0.6.1         gtools_2.7.0
>
>         loaded via a namespace (and not attached):
>            [1] biomaRt_2.12.0   codetools_0.2-8  colorspace_1.1-1
>         compiler_2.15.0  dichromat_1.2-4  digest_0.5.2
>         GEOquery_2.23.5  grid_2.15.0      labeling_0.1     memoise_0.1
>             munsell_0.3      reshape2_1.2.1   scales_0.2.1
>         stats4_2.15.0    tcltk_2.15.0     XML_3.9-4        zlibbioc_1.2.0
>
>
>
>
>     --
>     Computational Biology / Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N.
>     PO Box 19024 Seattle, WA 98109
>
>     Location: Arnold Building M1 B861
>     Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
>
>
>     _________________________________________________
>     Bioconductor mailing list
>     Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>     https://stat.ethz.ch/mailman/__listinfo/bioconductor
>     <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>     Search the archives:
>     http://news.gmane.org/gmane.__science.biology.informatics.__conductor <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>
>
> --
> /A model is a lie that helps you see the truth./
> /
> /
> Howard Skipper
> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list