[BioC] [Hinxton #251937] RE: GenomicFeatures::makeTranscriptDbFromBiomart - BioMart data anomaly: for some transcripts, the cds cumulative length inferred from the exon and UTR info doesn't match the "cds_length" attribute from BioMart

Cook, Malcolm MEC at stowers.org
Tue Feb 7 22:35:27 CET 2012


Herve, Thanks so much for digging into this.

Rhonda, I had submitted a ticket as suggested to Ensembl helpdesk, and have included them as recipients to this message (after changing the subject to include the issue tracker number).

Ensembl helpdesk, I expect that Herve's detailed report, below, provides an example of the reported data anomaly that will help resolve the underlying issue.

Cheers,

~Malcolm


> -----Original Message-----
> From: Hervé Pagès [mailto:hpages at fhcrc.org]
> Sent: Tuesday, February 07, 2012 2:37 PM
> To: Rhoda Kinsella; bioconductor at r-project.org
> Cc: Cook, Malcolm
> Subject: Re: [BioC] GenomicFeatures::makeTranscriptDbFromBiomart -
> BioMart data anomaly: for some transcripts, the cds cumulative length
> inferred from the exon and UTR info doesn't match the "cds_length"
> attribute from BioMart
> 
> Hi Rhoda, Malcolm, and others,
> 
> So after taking a closer look at this, I can confirm that the reported
> "cds_length" looks wrong for some Fly transcripts. Take for example
> the FBtr0079414 transcript (minus strand):
> 
>  > library(biomaRt)
>  > mart1 <- useMart(biomart="ensembl",
> dataset="dmelanogaster_gene_ensembl")
>  > attributes <- c("ensembl_transcript_id", "strand",
> +                 "rank", "exon_chrom_start", "exon_chrom_end",
> +                 "5_utr_start", "5_utr_end", "3_utr_start", "3_utr_end",
> +                 "cds_length")
>  > filters <- "ensembl_transcript_id"
>  > values <- "FBtr0079414"
>  > getBM(attributes=attributes, filters=filters, values=values, mart=mart1)
>    ensembl_transcript_id strand rank exon_chrom_start exon_chrom_end
> 5_utr_start
> 1           FBtr0079414     -1    1          7218909        7220029
> 7219112
> 2           FBtr0079414     -1    2          7218643        7218853
>       NA
>    5_utr_end 3_utr_start 3_utr_end cds_length
> 1   7220029          NA        NA        204
> 2        NA     7218643   7218853        204
> 
> 2 exons: The 3' UTR (located on exon 2) spans the entire exon so no
> CDS on this exon. The start of the 5' UTR (located on exon 1) is 203
> bases upstream of the exon start. But the reported cds_length is 204.
> Something looks wrong.
> 
> For other transcripts, e.g. FBtr0300689 (plus strand), things look OK:
> 
>  > getBM(attributes=attributes, filters=filters, values=values, mart=mart1)
>    ensembl_transcript_id strand rank exon_chrom_start exon_chrom_end
> 5_utr_start
> 1           FBtr0300689      1    1             7529           8116
>     7529
> 2           FBtr0300689      1    2             8193           9484
>       NA
>    5_utr_end 3_utr_start 3_utr_end cds_length
> 1      7679          NA        NA        855
> 2        NA        8611      9484        855
> 
> 2 exons: The end of the 5' UTR (located on exon 1) is 437 bases
> upstream of the exon end. The start of the 3' UTR (located on exon 2)
> is 418 bases downstream of the exon start. So the CDS total length is
> 437 + 418 = 855, as reported.
> 
> @Malcolm and other makeTranscriptDbFromBiomart() users: I'm about to
> commit a patch to this function so that this anomaly in the Ensembl
> data causes a warning instead of an error. Also the warning will
> display the first 6 affected transcripts. The patch will make it into
> GenomicFeatures 1.6.8 (release) and 1.7.14 (devel), which will become
> available via biocLite() in the next 24-36 hours.
> 
> Cheers,
> H.
> 
> 
> On 02/06/2012 02:18 PM, Hervé Pagès wrote:
> > Hi Rhoda and others,
> >
> > I still need to check that this error issued by internal helper
> > .extractCdsRangesFromBiomartTable() about "the cds cumulative
> > length inferred from the exon and UTR not matching the cds_length
> > attribute from BioMart" is not a FALSE positive.
> >
> > I'm planning to patch the code in charge of this sanity check
> > so it issues a warning instead of an error and it displays
> > something more useful than just "for some transcripts etc...".
> > Would be nice to know at least for which transcript.
> >
> > I'll keep you informed, thanks!
> > H.
> >
> >
> > On 02/06/2012 12:53 AM, Rhoda Kinsella wrote:
> >> Hi Malcolm and Marc,
> >> Please submit an Ensembl helpdesk ticket about this issue along with a
> >> detailed example to (helpdesk at ensembl.org) and we will look into it.
> >> Kind regards
> >> Rhoda
> >>
> >>
> >> On 3 Feb 2012, at 20:32, Cook, Malcolm wrote:
> >>
> >>> Hi Marc, and other `library(GenomicFeatures)` users working in fly,
> >>>
> >>> I just changed Subject to keep alive one of the issues I still have,
> >>> namely:
> >>>
> >>> I get the following error:
> >>>
> >>>> library(GenomicFeatures)
> >>>> txdb<-makeTranscriptDbFromBiomart(biomart="ensembl",
> >>>> dataset="dmelanogaster_gene_ensembl", circ_seqs=NULL))
> >>> Download and preprocess the 'transcripts' data frame ... OK
> >>> Download and preprocess the 'chrominfo' data frame ... OK
> >>> Download and preprocess the 'splicings' data frame ... Error
> >>> in .extractCdsRangesFromBiomartTable(bm_table) :
> >>> BioMart data anomaly: for some transcripts, the cds cumulative
> >>> length inferred from the exon and UTR info doesn't match the
> >>> "cds_length" attribute from BioMart
> >>>
> >>>
> >>> Marc, you already observed that:
> >>>
> >>>>>> the data for cds ranges and total cds length (both from biomaRt) no
> >>>>>> longer agree with each other. In other words, the data from the
> >>>>>> current
> >>>>>> drosophila ranges in biomaRt seems to disagree with itself, and
> >>>>>> so the
> >>>>>> code is refusing to make a package out of this data as a result.
> >>>>>> To get the 2nd issue fixed probably involves talking to ensembl
> >>>>>> about
> >>>>>> their CDS data for fly to see if we can resolve the discrepancy.
> >>>>> I would be happy to take this to them.
> >>>
> >>> I still wonder:
> >>>
> >>>> Can you recommend a best way to get a more diagnostic trace from the
> >>>> attempt at txdb creation so we can correctly report to ensembl team
> >>>> the
> >>>> errant transcript(s) ?
> >>>
> >>> I would be happy to take this up with Ensembl team, but, need
> >>> details which I don't know how to produce.
> >>>
> >>>
> >>> Finally, one the side, here is a tiny suggestion:
> >>>
> >>> * change the default for circ_seqs in makeTranscriptDbFromBiomart
> >>> to be NULL, instead of any organism (human) specific.
> >>>
> >>> Regards,
> >>>
> >>> --Malcolm
> >>>
> >>>
> >>> R version 2.14.0 (2011-10-31)
> >>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
> >>>
> >>> locale:
> >>> [1] C
> >>>
> >>> attached base packages:
> >>> [1] stats graphics grDevices utils datasets methods base
> >>>
> >>> other attached packages:
> >>> [1] GenomicFeatures_1.6.7 AnnotationDbi_1.16.11 Biobase_2.14.0
> >>> [4] GenomicRanges_1.6.6 IRanges_1.12.5
> >>>
> >>> loaded via a namespace (and not attached):
> >>> [1] BSgenome_1.22.0 Biostrings_2.22.0 DBI_0.2-5
> >>> RCurl_1.9-5
> >>> [5] RSQLite_0.11.1 XML_3.9-4 biomaRt_2.10.0
> >>> rtracklayer_1.14.4
> >>> [9] tools_2.14.0 zlibbioc_1.0.0
> >>>>
> >>>
> >>> _______________________________________________
> >>> Bioconductor mailing list
> >>> Bioconductor at r-project.org
> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>> Search the archives:
> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>
> >> Rhoda Kinsella Ph.D.
> >> Ensembl Production Project Leader,
> >> European Bioinformatics Institute (EMBL-EBI),
> >> Wellcome Trust Genome Campus,
> >> Hinxton
> >> Cambridge CB10 1SD,
> >> UK.
> >>
> >>
> >> [[alternative HTML version deleted]]
> >>
> >> _______________________________________________
> >> Bioconductor mailing list
> >> Bioconductor at r-project.org
> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> Search the archives:
> >> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> >
> 
> 
> --
> Hervé Pagès
> 
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
> 
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319



More information about the Bioconductor mailing list