[BioC] [Engineers for ensemblgenomes.org #251937] BioMart data anomaly: for some transcripts, the cds cumulative length inferred from the exon and UTR info doesn't match the "cds_length" attribute from BioMart

Cook, Malcolm MEC at stowers.org
Tue Jun 19 16:35:18 CEST 2012


Hi,

I am chiming in as the original reporter, and cc:ing Herve Pages from the
BioConductor project who was instrumental in providing diagnostic feedback
and coded much of the inner workings of the 'R' part.

When I now follow the steps I originally reported, now using today's
biomart (Ensembl 67), I find that transcripts are still identified having
the reported anomaly.

However, for my purposes, I now find the problem greatly ameliorated in
that:
	there are only 5 such
	they are all in the same alternatively spliced gene
	the BioConductor package now more gracefully raises a warning with a
detailed report instead an error.

I believe that examining the detailed report, included in my transcript
below, will reveal the remaining root cause to you.

Thanks for following up!  I hope this helps, and am looking forward to
ticket closed on this one!

~ Malcolm Cook


$ R
# use the package (assuming it and dependencies are installed)
library(GenomicFeatures)
# and try to build the TranscriptDb (expect error/warning here)
txdb<-makeTranscriptDbFromBiomart(biomart="ensembl",
dataset="dmelanogaster_gene_ensembl", circ_seqs=NULL))
Download and preprocess the 'transcripts' data frame ... OK

Download and preprocess the 'chrominfo' data frame ... OK
Download and preprocess the 'splicings' data frame ... OK
Download and preprocess the 'genes' data frame ... OK
Prepare the 'metadata' data frame ... OK
metadata: OK
Make the TranscriptDb object ... OK
Warning message:
In .warningWithBioMartDataAnomalyReport(bm_table, idx, id_prefix,  :
  BioMart data anomaly: in the following transcripts,
  the CDS total length inferred from the exon and UTR info
  doesn't match the "cds_length" attribute from BioMart.
  1. Transcript FBtr0084080:
       strand rank exon_chrom_start exon_chrom_end  ensembl_exon_id
5_utr_start 5_utr_end 3_utr_start 3_utr_end cds_length
     1     -1    1         17203010       17203121   FBgn0002781:30
17203010  17203121          NA        NA        887
     2     -1    2         17202541       17202798   FBgn0002781:29
17202749  17202798          NA        NA        887
     3     -1    3         17202324       17202463 FBgn0002781:28-A
  NA        NA          NA        NA        887
     4     -1    4         17195184       17195967   FBgn0002781:39
  NA        NA    17195184  17195428        887
     5     -1    5         17200782       17201634 FBgn0002781:27-B
  NA        NA          NA        NA        887
  2. Transcript FBtr0084077:
       strand rank exon_chrom_start exon_chrom_end  ensembl_exon_id
5_utr_start 5_utr_end 3_utr_start 3_utr_end cds_length
     1     -1    3         17203010       17203121   FBgn0002781:30
17203010  17203121          NA        NA       -213
     2     -1    4         17202541       17202798   FBgn0002781:29
17202755  17202798          NA        NA       -213
     3     -1    1         17202324       17202463 FBgn0002781:28-B
  NA        NA          NA        NA       -213
     4     -1    2         17177331       17177608    FBgn0002781:1
  NA        NA    17177331  17177387       -213
     5     -1    5         17200782       17201634 FBgn0002781:27-A
  NA        NA          NA        NA       -213
  3. Transcript FBtr0084082:
       strand rank exon_chrom_start exon_chrom_end  ensembl_exon_id
5_utr_start 5_utr_end 3_utr_start 3_utr_end cds_length
     1     -1    3         17203010       17203121   FBgn0002781:30
17203010  17203121          NA        NA       -466
     2     -1    4         17202541       17202798   FBgn0002781:29
17202749  17202798          NA        NA       -466
     3     -1    1         17202324       17202463 FBgn0002781:28-B
  NA        NA          NA        NA       -466
     4     -1    5         17200782       17201634 FBgn0002781:27-A
  NA        NA          NA        NA       -466
     5     -1    2         17193632       17193960   FBgn0002781:37
  NA        NA    17193632  17193935       -466
  4. Transcript FBtr0084079:
       strand rank exon_chrom_start exon_chrom_end  ensembl_exon_id
5_utr_start 5_utr_end 3_utr_start 3_utr_end cds_length
     1     -1    1         17203010       17203121   FBgn0002781:30
17203010  17203121          NA        NA       1572
     2     -1    2         17202541       17202798   FBgn0002781:29
17202749  17202798          NA        NA       1572
     3     -1    3         17202324       17202463 FBgn0002781:28-A
  NA        NA          NA        NA       1572
     4     -1    4         17200782       17201634 FBgn0002781:27-B
  NA        NA          NA        NA       1572
     5     -1    5         17186112       17186276   FBgn0002781:31
  NA        NA    17186112  17186276       1572
     6     -1    6         17186350       17187009   FBgn0002781:32
  NA        NA    17186350  17186803       1572
  5. Transcript FBtr0084085:
       strand rank exon_chrom_start exon_chrom_end  ensembl_exon_id
5_utr_start 5_utr_end 3_utr_start 3_utr_end cds_length
     1     -1    1         17203010       17203121   FBgn0002781:30
17203010  17203121          NA        NA       1729
     2     -1    2         17202541       17202798   FBgn0002781:29
17202749  17202798          NA        NA       1729
     3     -1    3         17202324       17202463 FBgn0002781:28-A
  NA        NA          NA        NA       1729
     4     -1    4         17200782       17201634 FBgn0002781:27-B
  NA        NA          NA        NA       1729
     5     -1    5         17187120       17187332   FBgn0002781:33
  NA        NA    17187120  17187332       1729
     6     -1    6         17187392       17187860   FBgn0002781:34
  NA        NA    17187392  17187545       1729

# show off the txdb's metadata
> txdb
TranscriptDb object:
| Db type: TranscriptDb
| Supporting package: GenomicFeatures
| Data source: BioMart
| Genus and Species: Drosophila melanogaster
| Resource URL: www.biomart.org:80
| BioMart database: ensembl
| BioMart database version: ENSEMBL GENES 67 (SANGER UK)
| BioMart dataset: dmelanogaster_gene_ensembl
| BioMart dataset description: Drosophila melanogaster genes (BDGP5)
| BioMart dataset version: BDGP5
| Full dataset: yes
| miRBase build ID: NA
| transcript_nrow: 25415
| exon_nrow: 74818
| cds_nrow: 62601
| Db created by: GenomicFeatures package from Bioconductor
| Creation time: 2012-06-19 09:13:33 -0500 (Tue, 19 Jun 2012)
| GenomicFeatures version at creation time: 1.8.1
| RSQLite version at creation time: 0.11.1
| DBSCHEMAVERSION: 1.0



# show off details about the version of R and libraries used.
> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] GenomicFeatures_1.8.1 AnnotationDbi_1.18.1  Biobase_2.16.0
GenomicRanges_1.8.6   IRanges_1.14.3        BiocGenerics_0.2.0
BiocInstaller_1.4.6

loaded via a namespace (and not attached):
 [1] BSgenome_1.24.0    Biostrings_2.24.1  DBI_0.2-5          RCurl_1.91-1
      RSQLite_0.11.1     Rsamtools_1.8.5    XML_3.9-4
biomaRt_2.12.0     bitops_1.0-4.1     rtracklayer_1.16.1 stats4_2.15.0
 tools_2.15.0       zlibbioc_1.2.0
> 




On 6/19/12 8:35 AM, "kmegy at ebi.ac.uk via RT" <helpdesk at ensemblgenomes.org>
wrote:

>Which species was this again? Drosophila?
>
>I fixed something about STOP codons for Droso., but it's probably not
>what he is talking about.
>
>
>On 19 Jun 2012, at 14:32, Dan Staines wrote:
>
>> I believe that Karyn fixed this but Dan L & co are probably in a better
>>position to comment.
>> 
>> On 06/19/2012 01:36 PM, Bert Overduin via RT wrote:
>>> Hi Dan,
>>> 
>>> Has this been fixed in EG14?
>>> 
>>> Cheers,
>>> Bert
>>> 
>>> On Sun, Apr 15, 2012 at 5:56 PM, Dan Staines via RT
>>> <helpdesk at ensemblgenomes.org>  wrote:
>>>> Hi Malcolm,
>>>> 
>>>> I've just asked for an update on this. Fixes that we've applied
>>>>recently do not
>>>> unfortunately appear to fix the issue. However, we're continuing to
>>>>investigate
>>>> how to fix this and are aiming for a fix for EG14 in May.
>>>> 
>>>> Best,
>>>> 
>>>> Dan.
>>>> 
>>>> .
>>>> 
>>>> --
>>>> Ticket Details<URL:
>>>>https://rt.sanger.ac.uk/SelfService/Display.html?id=251937>
>>>> 
>>>> 
>>>> --
>>>>  The Wellcome Trust Sanger Institute is operated by Genome Research
>>>>  Limited, a charity registered in England with number 1021457 and a
>>>>  company registered in England with number 2742969, whose registered
>>>>  office is 215 Euston Road, London, NW1 2BE.
>>> 
>>> 
>>> 
>> 
>> -- 
>> Dan Staines, PhD               Ensembl Genomes Technical Coordinator
>> EMBL-EBI                       Tel: +44-(0)1223-492507
>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>
>
>
>-- 
>Ticket Details <URL:
>https://rt.sanger.ac.uk/SelfService/Display.html?id=251937 >
>
>
>-- 
> The Wellcome Trust Sanger Institute is operated by Genome Research
> Limited, a charity registered in England with number 1021457 and a
> company registered in England with number 2742969, whose registered
> office is 215 Euston Road, London, NW1 2BE. 



More information about the Bioconductor mailing list