[BioC] autoplot transcriptDb error with some regions

Mon Nov 4 17:36:52 CET 2013

Tengfei & Herve,

I too am afflicted with this error and hoping that the following reproducible example will hasten a patch.  

I am unsure but speculate that this error is raised for the same transcripts wherein makeTranscriptDbFromUCSC issues warning:
In .extractCdsLocsFromUCSCTxTable(ucsc_txtable, exon_locs) :
  UCSC data anomaly in 434 transcript(s): the cds cumulative length is not a multiple of 3 for transcripts 'HRA1' 'tP(UGG)A' 'snR18' 'tA(UGC)A' 'tL(CAA)A' 'tS(AGA)A' 'YAR061W' 'YAR062W' 'tP(UGG)Q' '15S_rRNA' 'tW(UCA)Q' 'tE(UUC)Q' 'tS(UGA)Q2' '21S_rRNA' 'tT(UGU)Q1' 'tC(GCA)Q' 'tH(GUG)Q' 'tL(UAA)Q' 'tQ(UUG)Q' 'tK(UUU)Q' 'tR(UCU)Q1' 'tG(UCC)Q' 'tD(GUC)Q' 'tS(GCU)Q1' 'tR(ACG)Q2' 'tA(UGC)Q' 'tI(GAU)Q' 'tY(GUA)Q' 'tN(GUU)Q' 'tM(CAU)Q1' 'tF(GAA)Q' 'tT(XXX)Q2' 'tV(UAC)Q' 'tM(CAU)Q2' 'RPM1' 'snR80' 'snR67' 'snR53' 'tG(GCC)E' 'tS(AGA)E'
  'tM(CAU)E' 'RPR1' 'tQ(UUG)E2' 'tK(CUU)E1' 'tR(UCU)E' 'snR14' 'tE(UUC)E1' 'tH(GUG)E1' 'tQ(UUG)E1' 'tS(UGA)E' 'tA(UGC)E' 'SRG1' 'tE(UUC)E2' 'snR4' 'snR52' 'tH(GUG)E2' 'tK(CUU)E2' 'tV(AAC)E1' 'SCR1' 'tI(AAU)E1' 'tV(AAC)E2' ' [... truncated]

Regarding which, the following thread may be of interest: 
https://stat.ethz.ch/pipermail/bioconductor/2010-July/034568.html
https://stat.ethz.ch/pipermail/bioconductor/2012-March/044214.html
http://permalink.gmane.org/gmane.science.biology.informatics.conductor/30105

In the last thread, Herve, you wonder:
> Should we allow
> the user to filter CDSs based on this status? Or should we import only
> complete CDSs? Or we import all the CDSs but we store in the metadata
> table of the TranscriptDb object (and then display this in the show
> method) the fact that not all the CDSs are complete?

In my case, a great workaround would be to provide option the drop (with warning) the incomplete ones.  Or, somehow interrogate the tr.db for which have this problem so I may drop them myself.

Tengfie, It would be great if any fix that works in the development version can be ported to the release branch as well.  

Cheers,

~Thanks,

Malcolm

library(ggbio)
library(GenomicFeatures)

tr.db<-
    makeTranscriptDbFromUCSC(
    ,genome='sacCer3'
    ,tablename='ensGene'
    )

tr.by.gn.grl<-transcriptsBy(tr.db,'gene')

gn.gr<-unlist(range(tr.by.gn.grl),use.names=TRUE)

a2<-geom_alignment(tr.db,which=gn.gr[2]) # this works!

geom_alignment(tr.db,which=gn.gr['HRA1']) # this breaks

gn.gr[1]
sessionInfo()

## whose output is:

In .extractCdsLocsFromUCSCTxTable(ucsc_txtable, exon_locs) :
  UCSC data anomaly in 434 transcript(s): the cds cumulative length is not a multiple of 3 for transcripts 'HRA1' 'tP(UGG)A' 'snR18' 'tA(UGC)A' 'tL(CAA)A' 'tS(AGA)A' 'YAR061W' 'YAR062W' 'tP(UGG)Q' '15S_rRNA' 'tW(UCA)Q' 'tE(UUC)Q' 'tS(UGA)Q2' '21S_rRNA' 'tT(UGU)Q1' 'tC(GCA)Q' 'tH(GUG)Q' 'tL(UAA)Q' 'tQ(UUG)Q' 'tK(UUU)Q' 'tR(UCU)Q1' 'tG(UCC)Q' 'tD(GUC)Q' 'tS(GCU)Q1' 'tR(ACG)Q2' 'tA(UGC)Q' 'tI(GAU)Q' 'tY(GUA)Q' 'tN(GUU)Q' 'tM(CAU)Q1' 'tF(GAA)Q' 'tT(XXX)Q2' 'tV(UAC)Q' 'tM(CAU)Q2' 'RPM1' 'snR80' 'snR67' 'snR53' 'tG(GCC)E' 'tS(AGA)E'
  'tM(CAU)E' 'RPR1' 'tQ(UUG)E2' 'tK(CUU)E1' 'tR(UCU)E' 'snR14' 'tE(UUC)E1' 'tH(GUG)E1' 'tQ(UUG)E1' 'tS(UGA)E' 'tA(UGC)E' 'SRG1' 'tE(UUC)E2' 'snR4' 'snR52' 'tH(GUG)E2' 'tK(CUU)E2' 'tV(AAC)E1' 'SCR1' 'tI(AAU)E1' 'tV(AAC)E2' ' [... truncated]

Aggregating TranscriptDb...
Parsing exons...
Parsing cds...
Parsing transcripts...
Parsing utrs and aggregating...
Done
Constructing graphics...
> > 
Aggregating TranscriptDb...
Parsing exons...
Parsing cds...
Parsing transcripts...
Parsing utrs and aggregating...
Error in data.frame(tx_id = .nms, tx_name = .tx.nms, gene_id = .gid.nms,  : 
  arguments imply differing number of rows: 0, 1
> > > 
GRanges with 1 range and 0 metadata columns:
           seqnames       ranges strand
              <Rle>    <IRanges>  <Rle>
  15S_rRNA     chrM [6546, 8194]      +
  ---
  seqlengths:
      chrI   chrII  chrIII   chrIV    chrV   chrVI  chrVII chrVIII   chrIX    chrX   chrXI  chrXII chrXIII  chrXIV   chrXV  chrXVI    chrM
    230218  813184  316620 1531933  576874  270161 1090940  562643  439888  745751  666816 1078177  924431  784333 1091291  948066   85779
> 
R version 3.0.2 (2013-09-25)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] GenomicFeatures_1.14.0 AnnotationDbi_1.24.0   Biobase_2.22.0         GenomicRanges_1.14.3   XVector_0.2.0          IRanges_1.20.4         BiocGenerics_0.8.0     ggbio_1.10.0           ggplot2_0.9.3.1       

loaded via a namespace (and not attached):
 [1] biomaRt_2.18.0          Biostrings_2.30.0       biovizBase_1.10.0       bitops_1.0-6            BSgenome_1.30.0         cluster_1.14.4          colorspace_1.2-4        compiler_3.0.2          DBI_0.2-7               dichromat_2.0-0         digest_0.6.3            grid_3.0.2              gridExtra_0.9.1         gtable_0.1.2            Hmisc_3.12-2            labeling_0.2            lattice_0.20-24         MASS_7.3-29             munsell_0.4.2           plyr_1.8                proto_0.3-10            RColorBrewer_1.0-5      RCurl_1.95-4.1          reshape2_1.2.2         
[25] rpart_4.1-3             Rsamtools_1.14.1        RSQLite_0.11.4          rtracklayer_1.22.0      scales_0.2.3            stats4_3.0.2            stringr_0.6.2           tools_3.0.2             VariantAnnotation_1.8.2 XML_3.98-1.1            zlibbioc_1.8.0         
>

>-----Original Message-----
 >From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Tengfei Yin
 >Sent: Friday, October 18, 2013 12:05 PM
 >To: Alejandro Reyes
 >Cc: bioconductor at r-project.org
 >Subject: Re: [BioC] autoplot transcriptDb error with some regions
 >
 >Hi Alejandro,
 >
 >Thanks for reporting, I believe that's a bug caused by my recent
 >modification in biovizBase package, I am working on that now, will keep you
 >updated.
 >
 >Best
 >
 >Tengfei
 >
 >
 >On Fri, Oct 18, 2013 at 12:43 PM, Alejandro Reyes
 ><alejandro.reyes at embl.de>wrote:
 >
 >> Dear Tengfei Yin,
 >>
 >> Firstly, thanks for developing ggbio, it has been very useful for me!
 >>
 >> I am getting an error when using autoplot with some specific genomic
 >> regions in transcriptDb objects, here is an example:
 >>
 >> > suppressMessages( library(ggbio) )
 >> > suppressMessages(library(**GenomicFeatures))
 >> > tx <- makeTranscriptDbFromBiomart()
 >>
 >> Aggregating TranscriptDb...
 >> Parsing exons...
 >> Parsing cds...
 >> Parsing transcripts...
 >> Parsing utrs and aggregating...
 >> Done
 >> Constructing graphics...
 >>
 >> prueba <- GRanges( 16, IRanges( start=69598997, 69718569 ) )
 >> autoplot( tx, prueba, group.selfish=TRUE, names.expr="")
 >>
 >> Aggregating TranscriptDb...
 >> Parsing exons...
 >> Parsing cds...
 >> Parsing transcripts...
 >> Parsing utrs and aggregating...
 >> Done
 >> Constructing graphics...
 >>
 >> So far, excellent, however, when I look into a smaller region I get an
 >> error message:
 >>
 >> > prueba <- GRanges( "16", IRanges(start=69718724, end=69720078 ))
 >> > autoplot( tx, prueba, group.selfish=TRUE, names.expr="")
 >> Aggregating TranscriptDb...
 >> Parsing exons...
 >> Parsing cds...
 >> Parsing transcripts...
 >> Parsing utrs and aggregating...
 >> Error in DataFrame(...) : different row counts implied by arguments
 >>
 >> I believe it has to do with recent modifications of ggbio, since I do not
 >> get the error message with older versions, e.g. 1.9.7.
 >>
 >> > sessionInfo()
 >> R Under development (unstable) (2013-07-01 r63121)
 >> Platform: x86_64-unknown-linux-gnu (64-bit)
 >>
 >> locale:
 >>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 >>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 >>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 >>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 >>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
 >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
 >>
 >> attached base packages:
 >> [1] parallel  stats     graphics  grDevices utils     datasets methods
 >> [8] base
 >>
 >> other attached packages:
 >>  [1] ggbio_1.11.0           ggplot2_0.9.3.1 GenomicFeatures_1.15.0
 >>  [4] AnnotationDbi_1.23.28  Biobase_2.21.7 GenomicRanges_1.13.56
 >>  [7] XVector_0.1.4          IRanges_1.19.40 BiocGenerics_0.7.8
 >> [10] BiocInstaller_1.13.1
 >>
 >> loaded via a namespace (and not attached):
 >>  [1] biomaRt_2.17.3           Biostrings_2.29.19 biovizBase_1.9.4
 >>  [4] bitops_1.0-6             BSgenome_1.29.1 cluster_1.14.4
 >>  [7] colorspace_1.2-4         DBI_0.2-7 dichromat_2.0-0
 >> [10] digest_0.6.3             grid_3.1.0 gridExtra_0.9.1
 >> [13] gtable_0.1.2             Hmisc_3.12-2 labeling_0.2
 >> [16] lattice_0.20-24          MASS_7.3-29 munsell_0.4.2
 >> [19] plyr_1.8                 proto_0.3-10 RColorBrewer_1.0-5
 >> [22] RCurl_1.95-4.1           reshape2_1.2.2 rpart_4.1-3
 >> [25] Rsamtools_1.13.53        RSQLite_0.11.4 rtracklayer_1.21.14
 >> [28] scales_0.2.3             stats4_3.1.0 stringr_0.6.2
 >> [31] tools_3.1.0              VariantAnnotation_1.7.57 XML_3.98-1.1
 >> [34] zlibbioc_1.7.0
 >>
 >> Best regards,
 >> Alejandro Reyes
 >>
 >
 >
 >
 >--
 >Tengfei Yin, PhD
 >Seven Bridges Genomics
 >sbgenomics.com
 >625 Mt. Auburn St. Suite #208
 >Cambridge, MA 02138
 >(617) 866-0446
 >
 >	[[alternative HTML version deleted]]
 >
 >_______________________________________________
 >Bioconductor mailing list
 >Bioconductor at r-project.org
 >https://stat.ethz.ch/mailman/listinfo/bioconductor
 >Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor