[BioC] [Hinxton #251937] RE: GenomicFeatures::makeTranscriptDbFromBiomart - BioMart data anomaly: for some transcripts, the cds cumulative length inferred from the exon and UTR info doesn't match the "cds_length" attribute from BioMart

Cook, Malcolm MEC at stowers.org
Wed Mar 14 16:37:44 CET 2012


Herve,

I'm following up on this by bringing you into an exchange with the Ensembl member handling dmel.  I hope with your help they can completely address the issue.

Thanks,

~Malcolm


> -----Original Message-----
> From: Hervé Pagès [mailto:hpages at fhcrc.org]
> Sent: Tuesday, March 13, 2012 3:32 PM
> To: Cook, Malcolm
> Cc: Rhoda Kinsella; bioconductor at r-project.org
> Subject: Re: [Hinxton #251937] RE: [BioC]
> GenomicFeatures::makeTranscriptDbFromBiomart - BioMart data anomaly:
> for some transcripts, the cds cumulative length inferred from the exon and
> UTR info doesn't match the "cds_length" attribute from BioMart
> 
> Hi Malcolm, Rhoda,
> 
> Did you hear back from the Ensembl helpdesk about this issue?
> 
> AFAICT the issue is still in Ensembl release 66 (released 10 days
> ago). For example, when querying directly the Ensembl Mart, I get
> the following for transcript FBtr0079414 (dmelanogaster):
> 
>    Exon Rank in Transcript | Chromosome Name | Strand
>    1                       | 2L              | -1
>    2                       | 2L              | -1
> 
>    Exon Chr Start (bp) | Exon Chr End (bp)
>    7218909             | 7220029
>    7218643             | 7218853
> 
>    5' UTR Start | 5' UTR End | 3' UTR Start | 3' UTR End
>    7219112      | 7220029    |              |
>                 |            | 7218643      | 7218853
> 
>    CDS Start | CDS End | CDS Length
>    1         | 203     | 204
>    204       | 204     | 204
> 
> Note that querying directly the Ensembl Mart thru the web interface
> allows me to choose database Ensembl Genes 66 but querying with the
> Bioconductor biomaRt package is still accessing Ensembl Genes 65,
> I wonder why, but this is a different story...
> 
> So the "CDS Length" column (which, IIUC, is actually supposed to
> report the "Total CDS Length") is still incompatible with the
> exon/UTR starts and ends. If the exon/UTR starts and ends
> are correct then the total CDS length should be 203, not 204.
> 
> But also, it could be that the exon/UTR starts and ends are
> incorrect.
> 
> Finally note that there is no CDS region on exon 2 (the 3' UTR
> entirely spans exon 2) but the Ensembl Mart reports a CDS region
> of length 1 on this exon (CDS Start = CDS End = 204). This is
> probably why then the reported CDS Length is 204 (at least it's
> consistent with the highest "CDS End" value).
> 
> Would be nice to see this dataset fixed.
> 
> Thanks,
> H.
> 
> 
> On 02/15/2012 06:33 AM, Cook, Malcolm wrote:
> > Dear helpdesk at ensemblgenomes.org,
> >
> > I am following up on this issue which I understand Rhoda Kinsella at EBI to
> have forwarded to you.
> >
> > I originally identified and reported the issue, first to the bioconductor email
> list where Rhoda picked up on it and replied as below.
> >
> > I am trying to  ensure that there is a tracked issue with
> ensemblgenomes.org with my name on it - not that it has to be resolved
> with a fix, just I'd like to be assured I know as you resolve it.
> >
> > If there is anything further I can provide pertaining to describing or
> resolving the issue, please advise.
> >
> > Of course the issue may be in fact even further upstream - in flybase.  I've
> not tried to find the root cause myself.
> >
> > Thanks,
> >
> > Malcolm Cook
> >
> >
> > From: Rhoda Kinsella<rhoda at ebi.ac.uk<mailto:rhoda at ebi.ac.uk>>
> > Date: Wed, 8 Feb 2012 10:27:02 -0600
> > To: Malcolm Cook<mec at stowers.org<mailto:mec at stowers.org>>
> > Cc: Hervé Pagès<hpages at fhcrc.org<mailto:hpages at fhcrc.org>>,
> "bioconductor at r-project.org<mailto:bioconductor at r-
> project.org>"<bioconductor at r-project.org<mailto:bioconductor at r-
> project.org>>
> > Subject: Re: [Hinxton #251937] RE: [BioC]
> GenomicFeatures::makeTranscriptDbFromBiomart - BioMart data anomaly:
> for some transcripts, the cds cumulative length inferred from the exon and
> UTR info doesn't match the "cds_length" attribute from BioMart
> >
> > Hi Malcolm and Hervé
> > This appears to be a data issue with the Drosophila core database which
> was then propagated into BioMart. I have forwarded the issue to the
> Ensembl Genomes project as they maintain this database and they will
> respond as soon as possible.
> > Regards
> > Rhoda
> >
> >
> > On 7 Feb 2012, at 21:35, Cook, Malcolm wrote:
> >
> > Herve, Thanks so much for digging into this.
> >
> > Rhonda, I had submitted a ticket as suggested to Ensembl helpdesk, and
> have included them as recipients to this message (after changing the subject
> to include the issue tracker number).
> >
> > Ensembl helpdesk, I expect that Herve's detailed report, below, provides
> an example of the reported data anomaly that will help resolve the
> underlying issue.
> >
> > Cheers,
> >
> > ~Malcolm
> >
> >
> > -----Original Message-----
> > From: Hervé Pagès [mailto:hpages at fhcrc.org]
> > Sent: Tuesday, February 07, 2012 2:37 PM
> > To: Rhoda Kinsella; bioconductor at r-project.org<mailto:bioconductor at r-
> project.org>
> > Cc: Cook, Malcolm
> > Subject: Re: [BioC] GenomicFeatures::makeTranscriptDbFromBiomart -
> > BioMart data anomaly: for some transcripts, the cds cumulative length
> > inferred from the exon and UTR info doesn't match the "cds_length"
> > attribute from BioMart
> >
> > Hi Rhoda, Malcolm, and others,
> >
> > So after taking a closer look at this, I can confirm that the reported
> > "cds_length" looks wrong for some Fly transcripts. Take for example
> > the FBtr0079414 transcript (minus strand):
> >
> > library(biomaRt)
> > mart1<- useMart(biomart="ensembl",
> > dataset="dmelanogaster_gene_ensembl")
> > attributes<- c("ensembl_transcript_id", "strand",
> > +                 "rank", "exon_chrom_start", "exon_chrom_end",
> > +                 "5_utr_start", "5_utr_end", "3_utr_start", "3_utr_end",
> > +                 "cds_length")
> > filters<- "ensembl_transcript_id"
> > values<- "FBtr0079414"
> > getBM(attributes=attributes, filters=filters, values=values, mart=mart1)
> >    ensembl_transcript_id strand rank exon_chrom_start exon_chrom_end
> > 5_utr_start
> > 1           FBtr0079414     -1    1          7218909        7220029
> > 7219112
> > 2           FBtr0079414     -1    2          7218643        7218853
> >       NA
> >    5_utr_end 3_utr_start 3_utr_end cds_length
> > 1   7220029          NA        NA        204
> > 2        NA     7218643   7218853        204
> >
> > 2 exons: The 3' UTR (located on exon 2) spans the entire exon so no
> > CDS on this exon. The start of the 5' UTR (located on exon 1) is 203
> > bases upstream of the exon start. But the reported cds_length is 204.
> > Something looks wrong.
> >
> > For other transcripts, e.g. FBtr0300689 (plus strand), things look OK:
> >
> > getBM(attributes=attributes, filters=filters, values=values, mart=mart1)
> >    ensembl_transcript_id strand rank exon_chrom_start exon_chrom_end
> > 5_utr_start
> > 1           FBtr0300689      1    1             7529           8116
> >     7529
> > 2           FBtr0300689      1    2             8193           9484
> >       NA
> >    5_utr_end 3_utr_start 3_utr_end cds_length
> > 1      7679          NA        NA        855
> > 2        NA        8611      9484        855
> >
> > 2 exons: The end of the 5' UTR (located on exon 1) is 437 bases
> > upstream of the exon end. The start of the 3' UTR (located on exon 2)
> > is 418 bases downstream of the exon start. So the CDS total length is
> > 437 + 418 = 855, as reported.
> >
> > @Malcolm and other makeTranscriptDbFromBiomart() users: I'm about to
> > commit a patch to this function so that this anomaly in the Ensembl
> > data causes a warning instead of an error. Also the warning will
> > display the first 6 affected transcripts. The patch will make it into
> > GenomicFeatures 1.6.8 (release) and 1.7.14 (devel), which will become
> > available via biocLite() in the next 24-36 hours.
> >
> > Cheers,
> > H.
> >
> >
> > On 02/06/2012 02:18 PM, Hervé Pagès wrote:
> > Hi Rhoda and others,
> >
> > I still need to check that this error issued by internal helper
> > .extractCdsRangesFromBiomartTable() about "the cds cumulative
> > length inferred from the exon and UTR not matching the cds_length
> > attribute from BioMart" is not a FALSE positive.
> >
> > I'm planning to patch the code in charge of this sanity check
> > so it issues a warning instead of an error and it displays
> > something more useful than just "for some transcripts etc...".
> > Would be nice to know at least for which transcript.
> >
> > I'll keep you informed, thanks!
> > H.
> >
> >
> > On 02/06/2012 12:53 AM, Rhoda Kinsella wrote:
> > Hi Malcolm and Marc,
> > Please submit an Ensembl helpdesk ticket about this issue along with a
> > detailed example to
> (helpdesk at ensembl.org<mailto:helpdesk at ensembl.org>) and we will look
> into it.
> > Kind regards
> > Rhoda
> >
> >
> > On 3 Feb 2012, at 20:32, Cook, Malcolm wrote:
> >
> > Hi Marc, and other `library(GenomicFeatures)` users working in fly,
> >
> > I just changed Subject to keep alive one of the issues I still have,
> > namely:
> >
> > I get the following error:
> >
> > library(GenomicFeatures)
> > txdb<-makeTranscriptDbFromBiomart(biomart="ensembl",
> > dataset="dmelanogaster_gene_ensembl", circ_seqs=NULL))
> > Download and preprocess the 'transcripts' data frame ... OK
> > Download and preprocess the 'chrominfo' data frame ... OK
> > Download and preprocess the 'splicings' data frame ... Error
> > in .extractCdsRangesFromBiomartTable(bm_table) :
> > BioMart data anomaly: for some transcripts, the cds cumulative
> > length inferred from the exon and UTR info doesn't match the
> > "cds_length" attribute from BioMart
> >
> >
> > Marc, you already observed that:
> >
> > the data for cds ranges and total cds length (both from biomaRt) no
> > longer agree with each other. In other words, the data from the
> > current
> > drosophila ranges in biomaRt seems to disagree with itself, and
> > so the
> > code is refusing to make a package out of this data as a result.
> > To get the 2nd issue fixed probably involves talking to ensembl
> > about
> > their CDS data for fly to see if we can resolve the discrepancy.
> > I would be happy to take this to them.
> >
> > I still wonder:
> >
> > Can you recommend a best way to get a more diagnostic trace from the
> > attempt at txdb creation so we can correctly report to ensembl team
> > the
> > errant transcript(s) ?
> >
> > I would be happy to take this up with Ensembl team, but, need
> > details which I don't know how to produce.
> >
> >
> > Finally, one the side, here is a tiny suggestion:
> >
> > * change the default for circ_seqs in makeTranscriptDbFromBiomart
> > to be NULL, instead of any organism (human) specific.
> >
> > Regards,
> >
> > --Malcolm
> >
> >
> > R version 2.14.0 (2011-10-31)
> > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
> >
> > locale:
> > [1] C
> >
> > attached base packages:
> > [1] stats graphics grDevices utils datasets methods base
> >
> > other attached packages:
> > [1] GenomicFeatures_1.6.7 AnnotationDbi_1.16.11 Biobase_2.14.0
> > [4] GenomicRanges_1.6.6 IRanges_1.12.5
> >
> > loaded via a namespace (and not attached):
> > [1] BSgenome_1.22.0 Biostrings_2.22.0 DBI_0.2-5
> > RCurl_1.9-5
> > [5] RSQLite_0.11.1 XML_3.9-4 biomaRt_2.10.0
> > rtracklayer_1.14.4
> > [9] tools_2.14.0 zlibbioc_1.0.0
> >
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org<mailto:Bioconductor at r-project.org>
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
> > http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> > Rhoda Kinsella Ph.D.
> > Ensembl Production Project Leader,
> > European Bioinformatics Institute (EMBL-EBI),
> > Wellcome Trust Genome Campus,
> > Hinxton
> > Cambridge CB10 1SD,
> > UK.
> >
> >
> > [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org<mailto:Bioconductor at r-project.org>
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
> > http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> >
> >
> >
> > --
> > Hervé Pagès
> >
> > Program in Computational Biology
> > Division of Public Health Sciences
> > Fred Hutchinson Cancer Research Center
> > 1100 Fairview Ave. N, M1-B514
> > P.O. Box 19024
> > Seattle, WA 98109-1024
> >
> > E-mail: hpages at fhcrc.org<mailto:hpages at fhcrc.org>
> > Phone:  (206) 667-5791
> > Fax:    (206) 667-1319
> >
> > Rhoda Kinsella Ph.D.
> > Ensembl Production Project Leader,
> > European Bioinformatics Institute (EMBL-EBI),
> > Wellcome Trust Genome Campus,
> > Hinxton
> > Cambridge CB10 1SD,
> > UK.
> >
> 
> 
> --
> Hervé Pagès
> 
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
> 
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319



More information about the Bioconductor mailing list