[BioC] autoplot transcriptDb error with some regions
Cook, Malcolm
MEC at stowers.org
Mon Nov 4 17:36:52 CET 2013
Tengfei & Herve,
I too am afflicted with this error and hoping that the following reproducible example will hasten a patch.
I am unsure but speculate that this error is raised for the same transcripts wherein makeTranscriptDbFromUCSC issues warning:
In .extractCdsLocsFromUCSCTxTable(ucsc_txtable, exon_locs) :
UCSC data anomaly in 434 transcript(s): the cds cumulative length is not a multiple of 3 for transcripts 'HRA1' 'tP(UGG)A' 'snR18' 'tA(UGC)A' 'tL(CAA)A' 'tS(AGA)A' 'YAR061W' 'YAR062W' 'tP(UGG)Q' '15S_rRNA' 'tW(UCA)Q' 'tE(UUC)Q' 'tS(UGA)Q2' '21S_rRNA' 'tT(UGU)Q1' 'tC(GCA)Q' 'tH(GUG)Q' 'tL(UAA)Q' 'tQ(UUG)Q' 'tK(UUU)Q' 'tR(UCU)Q1' 'tG(UCC)Q' 'tD(GUC)Q' 'tS(GCU)Q1' 'tR(ACG)Q2' 'tA(UGC)Q' 'tI(GAU)Q' 'tY(GUA)Q' 'tN(GUU)Q' 'tM(CAU)Q1' 'tF(GAA)Q' 'tT(XXX)Q2' 'tV(UAC)Q' 'tM(CAU)Q2' 'RPM1' 'snR80' 'snR67' 'snR53' 'tG(GCC)E' 'tS(AGA)E'
'tM(CAU)E' 'RPR1' 'tQ(UUG)E2' 'tK(CUU)E1' 'tR(UCU)E' 'snR14' 'tE(UUC)E1' 'tH(GUG)E1' 'tQ(UUG)E1' 'tS(UGA)E' 'tA(UGC)E' 'SRG1' 'tE(UUC)E2' 'snR4' 'snR52' 'tH(GUG)E2' 'tK(CUU)E2' 'tV(AAC)E1' 'SCR1' 'tI(AAU)E1' 'tV(AAC)E2' ' [... truncated]
Regarding which, the following thread may be of interest:
https://stat.ethz.ch/pipermail/bioconductor/2010-July/034568.html
https://stat.ethz.ch/pipermail/bioconductor/2012-March/044214.html
http://permalink.gmane.org/gmane.science.biology.informatics.conductor/30105
In the last thread, Herve, you wonder:
> Should we allow
> the user to filter CDSs based on this status? Or should we import only
> complete CDSs? Or we import all the CDSs but we store in the metadata
> table of the TranscriptDb object (and then display this in the show
> method) the fact that not all the CDSs are complete?
In my case, a great workaround would be to provide option the drop (with warning) the incomplete ones. Or, somehow interrogate the tr.db for which have this problem so I may drop them myself.
Tengfie, It would be great if any fix that works in the development version can be ported to the release branch as well.
Cheers,
~Thanks,
Malcolm
library(ggbio)
library(GenomicFeatures)
tr.db<-
makeTranscriptDbFromUCSC(
,genome='sacCer3'
,tablename='ensGene'
)
tr.by.gn.grl<-transcriptsBy(tr.db,'gene')
gn.gr<-unlist(range(tr.by.gn.grl),use.names=TRUE)
a2<-geom_alignment(tr.db,which=gn.gr[2]) # this works!
geom_alignment(tr.db,which=gn.gr['HRA1']) # this breaks
gn.gr[1]
sessionInfo()
## whose output is:
In .extractCdsLocsFromUCSCTxTable(ucsc_txtable, exon_locs) :
UCSC data anomaly in 434 transcript(s): the cds cumulative length is not a multiple of 3 for transcripts 'HRA1' 'tP(UGG)A' 'snR18' 'tA(UGC)A' 'tL(CAA)A' 'tS(AGA)A' 'YAR061W' 'YAR062W' 'tP(UGG)Q' '15S_rRNA' 'tW(UCA)Q' 'tE(UUC)Q' 'tS(UGA)Q2' '21S_rRNA' 'tT(UGU)Q1' 'tC(GCA)Q' 'tH(GUG)Q' 'tL(UAA)Q' 'tQ(UUG)Q' 'tK(UUU)Q' 'tR(UCU)Q1' 'tG(UCC)Q' 'tD(GUC)Q' 'tS(GCU)Q1' 'tR(ACG)Q2' 'tA(UGC)Q' 'tI(GAU)Q' 'tY(GUA)Q' 'tN(GUU)Q' 'tM(CAU)Q1' 'tF(GAA)Q' 'tT(XXX)Q2' 'tV(UAC)Q' 'tM(CAU)Q2' 'RPM1' 'snR80' 'snR67' 'snR53' 'tG(GCC)E' 'tS(AGA)E'
'tM(CAU)E' 'RPR1' 'tQ(UUG)E2' 'tK(CUU)E1' 'tR(UCU)E' 'snR14' 'tE(UUC)E1' 'tH(GUG)E1' 'tQ(UUG)E1' 'tS(UGA)E' 'tA(UGC)E' 'SRG1' 'tE(UUC)E2' 'snR4' 'snR52' 'tH(GUG)E2' 'tK(CUU)E2' 'tV(AAC)E1' 'SCR1' 'tI(AAU)E1' 'tV(AAC)E2' ' [... truncated]
Aggregating TranscriptDb...
Parsing exons...
Parsing cds...
Parsing transcripts...
Parsing utrs and aggregating...
Done
Constructing graphics...
> >
Aggregating TranscriptDb...
Parsing exons...
Parsing cds...
Parsing transcripts...
Parsing utrs and aggregating...
Error in data.frame(tx_id = .nms, tx_name = .tx.nms, gene_id = .gid.nms, :
arguments imply differing number of rows: 0, 1
> > >
GRanges with 1 range and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
15S_rRNA chrM [6546, 8194] +
---
seqlengths:
chrI chrII chrIII chrIV chrV chrVI chrVII chrVIII chrIX chrX chrXI chrXII chrXIII chrXIV chrXV chrXVI chrM
230218 813184 316620 1531933 576874 270161 1090940 562643 439888 745751 666816 1078177 924431 784333 1091291 948066 85779
>
R version 3.0.2 (2013-09-25)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices datasets utils methods base
other attached packages:
[1] GenomicFeatures_1.14.0 AnnotationDbi_1.24.0 Biobase_2.22.0 GenomicRanges_1.14.3 XVector_0.2.0 IRanges_1.20.4 BiocGenerics_0.8.0 ggbio_1.10.0 ggplot2_0.9.3.1
loaded via a namespace (and not attached):
[1] biomaRt_2.18.0 Biostrings_2.30.0 biovizBase_1.10.0 bitops_1.0-6 BSgenome_1.30.0 cluster_1.14.4 colorspace_1.2-4 compiler_3.0.2 DBI_0.2-7 dichromat_2.0-0 digest_0.6.3 grid_3.0.2 gridExtra_0.9.1 gtable_0.1.2 Hmisc_3.12-2 labeling_0.2 lattice_0.20-24 MASS_7.3-29 munsell_0.4.2 plyr_1.8 proto_0.3-10 RColorBrewer_1.0-5 RCurl_1.95-4.1 reshape2_1.2.2
[25] rpart_4.1-3 Rsamtools_1.14.1 RSQLite_0.11.4 rtracklayer_1.22.0 scales_0.2.3 stats4_3.0.2 stringr_0.6.2 tools_3.0.2 VariantAnnotation_1.8.2 XML_3.98-1.1 zlibbioc_1.8.0
>
>-----Original Message-----
>From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Tengfei Yin
>Sent: Friday, October 18, 2013 12:05 PM
>To: Alejandro Reyes
>Cc: bioconductor at r-project.org
>Subject: Re: [BioC] autoplot transcriptDb error with some regions
>
>Hi Alejandro,
>
>Thanks for reporting, I believe that's a bug caused by my recent
>modification in biovizBase package, I am working on that now, will keep you
>updated.
>
>Best
>
>Tengfei
>
>
>On Fri, Oct 18, 2013 at 12:43 PM, Alejandro Reyes
><alejandro.reyes at embl.de>wrote:
>
>> Dear Tengfei Yin,
>>
>> Firstly, thanks for developing ggbio, it has been very useful for me!
>>
>> I am getting an error when using autoplot with some specific genomic
>> regions in transcriptDb objects, here is an example:
>>
>> > suppressMessages( library(ggbio) )
>> > suppressMessages(library(**GenomicFeatures))
>> > tx <- makeTranscriptDbFromBiomart()
>>
>> Aggregating TranscriptDb...
>> Parsing exons...
>> Parsing cds...
>> Parsing transcripts...
>> Parsing utrs and aggregating...
>> Done
>> Constructing graphics...
>>
>> prueba <- GRanges( 16, IRanges( start=69598997, 69718569 ) )
>> autoplot( tx, prueba, group.selfish=TRUE, names.expr="")
>>
>> Aggregating TranscriptDb...
>> Parsing exons...
>> Parsing cds...
>> Parsing transcripts...
>> Parsing utrs and aggregating...
>> Done
>> Constructing graphics...
>>
>> So far, excellent, however, when I look into a smaller region I get an
>> error message:
>>
>> > prueba <- GRanges( "16", IRanges(start=69718724, end=69720078 ))
>> > autoplot( tx, prueba, group.selfish=TRUE, names.expr="")
>> Aggregating TranscriptDb...
>> Parsing exons...
>> Parsing cds...
>> Parsing transcripts...
>> Parsing utrs and aggregating...
>> Error in DataFrame(...) : different row counts implied by arguments
>>
>> I believe it has to do with recent modifications of ggbio, since I do not
>> get the error message with older versions, e.g. 1.9.7.
>>
>> > sessionInfo()
>> R Under development (unstable) (2013-07-01 r63121)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] parallel stats graphics grDevices utils datasets methods
>> [8] base
>>
>> other attached packages:
>> [1] ggbio_1.11.0 ggplot2_0.9.3.1 GenomicFeatures_1.15.0
>> [4] AnnotationDbi_1.23.28 Biobase_2.21.7 GenomicRanges_1.13.56
>> [7] XVector_0.1.4 IRanges_1.19.40 BiocGenerics_0.7.8
>> [10] BiocInstaller_1.13.1
>>
>> loaded via a namespace (and not attached):
>> [1] biomaRt_2.17.3 Biostrings_2.29.19 biovizBase_1.9.4
>> [4] bitops_1.0-6 BSgenome_1.29.1 cluster_1.14.4
>> [7] colorspace_1.2-4 DBI_0.2-7 dichromat_2.0-0
>> [10] digest_0.6.3 grid_3.1.0 gridExtra_0.9.1
>> [13] gtable_0.1.2 Hmisc_3.12-2 labeling_0.2
>> [16] lattice_0.20-24 MASS_7.3-29 munsell_0.4.2
>> [19] plyr_1.8 proto_0.3-10 RColorBrewer_1.0-5
>> [22] RCurl_1.95-4.1 reshape2_1.2.2 rpart_4.1-3
>> [25] Rsamtools_1.13.53 RSQLite_0.11.4 rtracklayer_1.21.14
>> [28] scales_0.2.3 stats4_3.1.0 stringr_0.6.2
>> [31] tools_3.1.0 VariantAnnotation_1.7.57 XML_3.98-1.1
>> [34] zlibbioc_1.7.0
>>
>> Best regards,
>> Alejandro Reyes
>>
>
>
>
>--
>Tengfei Yin, PhD
>Seven Bridges Genomics
>sbgenomics.com
>625 Mt. Auburn St. Suite #208
>Cambridge, MA 02138
>(617) 866-0446
>
> [[alternative HTML version deleted]]
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at r-project.org
>https://stat.ethz.ch/mailman/listinfo/bioconductor
>Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list