[BioC] [Hinxton #251937] RE: GenomicFeatures::makeTranscriptDbFromBiomart - BioMart data anomaly: for some transcripts, the cds cumulative length inferred from the exon and UTR info doesn't match the "cds_length" attribute from BioMart

Hervé Pagès hpages at fhcrc.org
Tue Mar 13 23:09:10 CET 2012


Hi Steffen,

On 03/13/2012 02:37 PM, Steffen Durinck wrote:
> Hi Herve,
>
> To answer your question:
>
> "Bioconductor biomaRt package is still accessing Ensembl Genes 65,
> I wonder why, but this is a different story..."
>
> By default biomaRt queries http://www.biomart.org , which hosts a copy
> of Ensembl.  There is a time lag between an Ensembl update and an update
> of Ensembl on biomart.org <http://biomart.org>

Thanks Steffen for the details. Yes I knew about this lag, we see it at
each new Ensembl release. I guess the grumbling was more like "why on
earth every time it takes 2 weeks for the new Ensembl release to
propagate to http://biomart.org?". Or, "why on earth do we have to wait
2 weeks after each new Ensembl release to see our unit tests break in
the GenomicFeatures package?" ;-)

>
> An alternative is to query ensembl directly by specifying the host:
>
>  > library(biomaRt)
>  > listMarts(host="uswest.ensembl.org <http://uswest.ensembl.org>")
>                 biomart               version
> 1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 66
> 2     ENSEMBL_MART_SNP  Ensembl Variation 66
>  > mart =
> useMart("ENSEMBL_MART_ENSEMBL",dataset="hsapiens_gene_ensembl",host="uswest.ensembl.org
> <http://uswest.ensembl.org>")

Thanks for the reminder. I wish they could use the same biomart name:
why "ensembl" on http://biomart.org and "ENSEMBL_MART_ENSEMBL" on
http://uswest.ensembl.org. Now I'll stop grumbling...

>
>
> Note that the normal ensembl host is www.ensembl.org
> <http://www.ensembl.org>, but for some reason if you use this on the US
> west coast,  I end up in a redirect page to uswest.ensembl.org
> <http://uswest.ensembl.org> .  This redirecting is something new and
> biomaRt won't work currently if you use www.ensembl.org
> <http://www.ensembl.org> as host when you're based in the US, so use
> uswest.ensembl.org <http://uswest.ensembl.org>

Thanks for the extra details.

Cheers,
H.

>
> Cheers,
> Steffen
>
>
>
>
> 2012/3/13 Hervé Pagès <hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>
>     Hi Malcolm, Rhoda,
>
>     Did you hear back from the Ensembl helpdesk about this issue?
>
>     AFAICT the issue is still in Ensembl release 66 (released 10 days
>     ago). For example, when querying directly the Ensembl Mart, I get
>     the following for transcript FBtr0079414 (dmelanogaster):
>
>       Exon Rank in Transcript | Chromosome Name | Strand
>       1                       | 2L              | -1
>       2                       | 2L              | -1
>
>       Exon Chr Start (bp) | Exon Chr End (bp)
>       7218909             | 7220029
>       7218643             | 7218853
>
>       5' UTR Start | 5' UTR End | 3' UTR Start | 3' UTR End
>       7219112      | 7220029    |              |
>                    |            | 7218643      | 7218853
>
>       CDS Start | CDS End | CDS Length
>       1         | 203     | 204
>       204       | 204     | 204
>
>     Note that querying directly the Ensembl Mart thru the web interface
>     allows me to choose database Ensembl Genes 66 but querying with the
>     Bioconductor biomaRt package is still accessing Ensembl Genes 65,
>     I wonder why, but this is a different story...
>
>     So the "CDS Length" column (which, IIUC, is actually supposed to
>     report the "Total CDS Length") is still incompatible with the
>     exon/UTR starts and ends. If the exon/UTR starts and ends
>     are correct then the total CDS length should be 203, not 204.
>
>     But also, it could be that the exon/UTR starts and ends are
>     incorrect.
>
>     Finally note that there is no CDS region on exon 2 (the 3' UTR
>     entirely spans exon 2) but the Ensembl Mart reports a CDS region
>     of length 1 on this exon (CDS Start = CDS End = 204). This is
>     probably why then the reported CDS Length is 204 (at least it's
>     consistent with the highest "CDS End" value).
>
>     Would be nice to see this dataset fixed.
>
>     Thanks,
>     H.
>
>
>     On 02/15/2012 06:33 AM, Cook, Malcolm wrote:
>
>         Dear helpdesk at ensemblgenomes.org
>         <mailto:helpdesk at ensemblgenomes.org>,
>
>         I am following up on this issue which I understand Rhoda
>         Kinsella at EBI to have forwarded to you.
>
>         I originally identified and reported the issue, first to the
>         bioconductor email list where Rhoda picked up on it and replied
>         as below.
>
>         I am trying to  ensure that there is a tracked issue with
>         ensemblgenomes.org <http://ensemblgenomes.org> with my name on
>         it – not that it has to be resolved with a fix, just I'd like to
>         be assured I know as you resolve it.
>
>         If there is anything further I can provide pertaining to
>         describing or resolving the issue, please advise.
>
>         Of course the issue may be in fact even further upstream – in
>         flybase.  I've not tried to find the root cause myself.
>
>         Thanks,
>
>         Malcolm Cook
>
>
>         From: Rhoda Kinsella<rhoda at ebi.ac.uk
>         <mailto:rhoda at ebi.ac.uk><__mailto:rhoda at ebi.ac.uk
>         <mailto:rhoda at ebi.ac.uk>>>
>         Date: Wed, 8 Feb 2012 10:27:02 -0600
>         To: Malcolm Cook<mec at stowers.org
>         <mailto:mec at stowers.org><mailto:me__c at stowers.org
>         <mailto:mec at stowers.org>>>
>         Cc: Hervé Pagès<hpages at fhcrc.org
>         <mailto:hpages at fhcrc.org><mailto:__hpages at fhcrc.org
>         <mailto:hpages at fhcrc.org>>>, "bioconductor at r-project.org
>         <mailto:bioconductor at r-project.org><__mailto:bioconductor at r-project.__org
>         <mailto:bioconductor at r-project.org>>"<bioconductor at r-project.__org
>         <mailto:bioconductor at r-project.org><mailto:bioconductor at r-__project.org
>         <mailto:bioconductor at r-project.org>>>
>         Subject: Re: [Hinxton #251937] RE: [BioC]
>         GenomicFeatures::__makeTranscriptDbFromBiomart - BioMart data
>         anomaly: for some transcripts, the cds cumulative length
>         inferred from the exon and UTR info doesn't match the
>         "cds_length" attribute from BioMart
>
>         Hi Malcolm and Hervé
>         This appears to be a data issue with the Drosophila core
>         database which was then propagated into BioMart. I have
>         forwarded the issue to the Ensembl Genomes project as they
>         maintain this database and they will respond as soon as possible.
>         Regards
>         Rhoda
>
>
>         On 7 Feb 2012, at 21:35, Cook, Malcolm wrote:
>
>         Herve, Thanks so much for digging into this.
>
>         Rhonda, I had submitted a ticket as suggested to Ensembl
>         helpdesk, and have included them as recipients to this message
>         (after changing the subject to include the issue tracker number).
>
>         Ensembl helpdesk, I expect that Herve's detailed report, below,
>         provides an example of the reported data anomaly that will help
>         resolve the underlying issue.
>
>         Cheers,
>
>         ~Malcolm
>
>
>         -----Original Message-----
>         From: Hervé Pagès [mailto:hpages at fhcrc.org
>         <mailto:hpages at fhcrc.org>]
>         Sent: Tuesday, February 07, 2012 2:37 PM
>         To: Rhoda Kinsella; bioconductor at r-project.org
>         <mailto:bioconductor at r-project.org><__mailto:bioconductor at r-project.__org
>         <mailto:bioconductor at r-project.org>>
>         Cc: Cook, Malcolm
>         Subject: Re: [BioC] GenomicFeatures::__makeTranscriptDbFromBiomart -
>         BioMart data anomaly: for some transcripts, the cds cumulative
>         length
>         inferred from the exon and UTR info doesn't match the "cds_length"
>         attribute from BioMart
>
>         Hi Rhoda, Malcolm, and others,
>
>         So after taking a closer look at this, I can confirm that the
>         reported
>         "cds_length" looks wrong for some Fly transcripts. Take for example
>         the FBtr0079414 transcript (minus strand):
>
>         library(biomaRt)
>         mart1<- useMart(biomart="ensembl",
>         dataset="dmelanogaster_gene___ensembl")
>         attributes<- c("ensembl_transcript_id", "strand",
>         + "rank", "exon_chrom_start", "exon_chrom_end",
>         + "5_utr_start", "5_utr_end", "3_utr_start", "3_utr_end",
>         + "cds_length")
>         filters<- "ensembl_transcript_id"
>         values<- "FBtr0079414"
>         getBM(attributes=attributes, filters=filters, values=values,
>         mart=mart1)
>            ensembl_transcript_id strand rank exon_chrom_start exon_chrom_end
>         5_utr_start
>         1           FBtr0079414     -1    1          7218909        7220029
>         7219112
>         2           FBtr0079414     -1    2          7218643        7218853
>               NA
>            5_utr_end 3_utr_start 3_utr_end cds_length
>         1   7220029          NA        NA        204
>         2        NA     7218643   7218853        204
>
>         2 exons: The 3' UTR (located on exon 2) spans the entire exon so no
>         CDS on this exon. The start of the 5' UTR (located on exon 1) is 203
>         bases upstream of the exon start. But the reported cds_length is
>         204.
>         Something looks wrong.
>
>         For other transcripts, e.g. FBtr0300689 (plus strand), things
>         look OK:
>
>         getBM(attributes=attributes, filters=filters, values=values,
>         mart=mart1)
>            ensembl_transcript_id strand rank exon_chrom_start exon_chrom_end
>         5_utr_start
>         1           FBtr0300689      1    1             7529           8116
>             7529
>         2           FBtr0300689      1    2             8193           9484
>               NA
>            5_utr_end 3_utr_start 3_utr_end cds_length
>         1      7679          NA        NA        855
>         2        NA        8611      9484        855
>
>         2 exons: The end of the 5' UTR (located on exon 1) is 437 bases
>         upstream of the exon end. The start of the 3' UTR (located on
>         exon 2)
>         is 418 bases downstream of the exon start. So the CDS total
>         length is
>         437 + 418 = 855, as reported.
>
>         @Malcolm and other makeTranscriptDbFromBiomart() users: I'm about to
>         commit a patch to this function so that this anomaly in the Ensembl
>         data causes a warning instead of an error. Also the warning will
>         display the first 6 affected transcripts. The patch will make it
>         into
>         GenomicFeatures 1.6.8 (release) and 1.7.14 (devel), which will
>         become
>         available via biocLite() in the next 24-36 hours.
>
>         Cheers,
>         H.
>
>
>         On 02/06/2012 02:18 PM, Hervé Pagès wrote:
>         Hi Rhoda and others,
>
>         I still need to check that this error issued by internal helper
>         .__extractCdsRangesFromBiomartTab__le() about "the cds cumulative
>         length inferred from the exon and UTR not matching the cds_length
>         attribute from BioMart" is not a FALSE positive.
>
>         I'm planning to patch the code in charge of this sanity check
>         so it issues a warning instead of an error and it displays
>         something more useful than just "for some transcripts etc...".
>         Would be nice to know at least for which transcript.
>
>         I'll keep you informed, thanks!
>         H.
>
>
>         On 02/06/2012 12:53 AM, Rhoda Kinsella wrote:
>         Hi Malcolm and Marc,
>         Please submit an Ensembl helpdesk ticket about this issue along
>         with a
>         detailed example to (helpdesk at ensembl.org
>         <mailto:helpdesk at ensembl.org><mailto:h__elpdesk at ensembl.org
>         <mailto:helpdesk at ensembl.org>>) and we will look into it.
>         Kind regards
>         Rhoda
>
>
>         On 3 Feb 2012, at 20:32, Cook, Malcolm wrote:
>
>         Hi Marc, and other `library(GenomicFeatures)` users working in fly,
>
>         I just changed Subject to keep alive one of the issues I still have,
>         namely:
>
>         I get the following error:
>
>         library(GenomicFeatures)
>         txdb<-__makeTranscriptDbFromBiomart(__biomart="ensembl",
>         dataset="dmelanogaster_gene___ensembl", circ_seqs=NULL))
>         Download and preprocess the 'transcripts' data frame ... OK
>         Download and preprocess the 'chrominfo' data frame ... OK
>         Download and preprocess the 'splicings' data frame ... Error
>         in .__extractCdsRangesFromBiomartTab__le(bm_table) :
>         BioMart data anomaly: for some transcripts, the cds cumulative
>         length inferred from the exon and UTR info doesn't match the
>         "cds_length" attribute from BioMart
>
>
>         Marc, you already observed that:
>
>         the data for cds ranges and total cds length (both from biomaRt) no
>         longer agree with each other. In other words, the data from the
>         current
>         drosophila ranges in biomaRt seems to disagree with itself, and
>         so the
>         code is refusing to make a package out of this data as a result.
>         To get the 2nd issue fixed probably involves talking to ensembl
>         about
>         their CDS data for fly to see if we can resolve the discrepancy.
>         I would be happy to take this to them.
>
>         I still wonder:
>
>         Can you recommend a best way to get a more diagnostic trace from the
>         attempt at txdb creation so we can correctly report to ensembl team
>         the
>         errant transcript(s) ?
>
>         I would be happy to take this up with Ensembl team, but, need
>         details which I don't know how to produce.
>
>
>         Finally, one the side, here is a tiny suggestion:
>
>         * change the default for circ_seqs in makeTranscriptDbFromBiomart
>         to be NULL, instead of any organism (human) specific.
>
>         Regards,
>
>         --Malcolm
>
>
>         R version 2.14.0 (2011-10-31)
>         Platform: x86_64-apple-darwin9.8.0/x86___64 (64-bit)
>
>         locale:
>         [1] C
>
>         attached base packages:
>         [1] stats graphics grDevices utils datasets methods base
>
>         other attached packages:
>         [1] GenomicFeatures_1.6.7 AnnotationDbi_1.16.11 Biobase_2.14.0
>         [4] GenomicRanges_1.6.6 IRanges_1.12.5
>
>         loaded via a namespace (and not attached):
>         [1] BSgenome_1.22.0 Biostrings_2.22.0 DBI_0.2-5
>         RCurl_1.9-5
>         [5] RSQLite_0.11.1 XML_3.9-4 biomaRt_2.10.0
>         rtracklayer_1.14.4
>         [9] tools_2.14.0 zlibbioc_1.0.0
>
>
>         _________________________________________________
>         Bioconductor mailing list
>         Bioconductor at r-project.org
>         <mailto:Bioconductor at r-project.org><__mailto:Bioconductor at r-project.__org
>         <mailto:Bioconductor at r-project.org>>
>         https://stat.ethz.ch/mailman/__listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>         Search the archives:
>         http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>         <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>         Rhoda Kinsella Ph.D.
>         Ensembl Production Project Leader,
>         European Bioinformatics Institute (EMBL-EBI),
>         Wellcome Trust Genome Campus,
>         Hinxton
>         Cambridge CB10 1SD,
>         UK.
>
>
>         [[alternative HTML version deleted]]
>
>         _________________________________________________
>         Bioconductor mailing list
>         Bioconductor at r-project.org
>         <mailto:Bioconductor at r-project.org><__mailto:Bioconductor at r-project.__org
>         <mailto:Bioconductor at r-project.org>>
>         https://stat.ethz.ch/mailman/__listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>         Search the archives:
>         http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>         <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>
>
>         --
>         Hervé Pagès
>
>         Program in Computational Biology
>         Division of Public Health Sciences
>         Fred Hutchinson Cancer Research Center
>         1100 Fairview Ave. N, M1-B514
>         P.O. Box 19024
>         Seattle, WA 98109-1024
>
>         E-mail: hpages at fhcrc.org
>         <mailto:hpages at fhcrc.org><mailto:hpages__ at fhcrc.org
>         <mailto:hpages at fhcrc.org>>
>         Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>         Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>         Rhoda Kinsella Ph.D.
>         Ensembl Production Project Leader,
>         European Bioinformatics Institute (EMBL-EBI),
>         Wellcome Trust Genome Campus,
>         Hinxton
>         Cambridge CB10 1SD,
>         UK.
>
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B514
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>     _________________________________________________
>     Bioconductor mailing list
>     Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>     https://stat.ethz.ch/mailman/__listinfo/bioconductor
>     <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>     Search the archives:
>     http://news.gmane.org/gmane.__science.biology.informatics.__conductor <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list