[BioC] biomaRt- incorrect number of transcripts

Ivanek, Robert robert.ivanek at fmi.ch
Wed Nov 11 11:03:15 CET 2009


Dear Steffen,

I originally used only unique transcripts.
I guess that this is not the reason for different number of transcripts.
As you can see below in the original mail: the incorrect number of
transcripts is obtained via biomaRt package only for some chromosomes
and only if the filter is set to all mouse chromosomes.
However if you just set the filter for example to 3 chromosomes the
number of transcripts is correct. The number of transcripts is also
correct if you do not set any filter.

Regards

Robert

-----Original Message-----
From: Steffen at stat.Berkeley.EDU [mailto:Steffen at stat.Berkeley.EDU] 
Sent: Tuesday, November 10, 2009 8:26 PM
To: Ivanek, Robert
Cc: bioconductor at stat.math.ethz.ch
Subject: Re: [BioC] biomaRt- incorrect number of transcripts

Dear Robert,

Would it be possible to check if there are duplicates in the result you
obtain via the web?  By default biomaRt will retrieve only unique
results, sometimes when you query over the web results are duplicated.
To remove these you need to check the unique only checkbox when
exporting your results to a file.  Can you let me know if that explains
the difference in number of transcripts you notice?

Cheers,
Steffen

> Dear mailing list,
>
> I have recently observed a discrepancies in genome annotation obtained

> via R package biomaRt.
> I wanted to download all ensembl transcripts from the entire mouse 
> genome (chromosome 1:19, X, Y MT only).
>
> When I set the filter based on chromosome names I retrieved ~36000 
> transcript, please see the code below.
> However by using the web service www.biomart.org I received ~48000 
> transcripts for the same genome version and chromosomes.
>
> By comparing these two data frames you could see that the 
> discrepancies in number of transcripts occur only for some chromosomes
(3:9 and X).
> If I specified only two chromosome names (2 and 3) than the number of 
> downloaded transcripts is correct for both of them.
> If I did not set any filter in getBM function and did the filtering 
> manually in R, the number of transcripts is correct.
>
> Session info is attached.
>
> Best Regards
> Robert
>
> --
> Robert Ivanek
> Postdoctoral Fellow Schuebeler Group
> Friedrich Miescher Institute
> Maulbeerstrasse 66
> 4058 Basel / Switzerland
> Office phone: +41 61 697 6100
>
>
> R> library("biomaRt")
> R> ensembl <- useMart("ensembl", dataset = "mmusculus_gene_ensembl") 
> R> chroms <- c(1:19,"X","Y","MT") table(getBM(attributes = 
> R> c("ensembl_transcript_id",
> "chromosome_name", "strand", "transcript_start", "transcript_end"), 
> filters = "chromosome_name", values = chroms, mart =
> ensembl)$chromosome_name)
>
>    1   10   11   12   13   14   15   16   17   18   19    2    3    4
> 5    6    7    8    9   MT    X    Y
> 2507 1869 4364 1501 1630 1624 1404 1522 1865  985 1245 5232 1080 1454
> 845 1209 1487 1129 1031   41 2072   17
>
> R> ens.web <- 
> R> read.delim("../../../mart_export.txt",stringsAsFactors=F)
> R> ens.web <- ens.web[ens.web$Chromosome.Name %in% chroms,]
> R> table(ens.web$Chromosome.Name)
>
>    1   10   11   12   13   14   15   16   17   18   19    2    3    4
> 5    6    7    8    9   MT    X    Y
> 2507 1869 4364 1501 1630 1624 1404 1522 1865  985 1245 5232 2179 3997
> 2822 2524 3919 2021 2163   41 3297   17
>
> R> table(getBM(attributes = c("ensembl_transcript_id",
> "chromosome_name", "strand", "transcript_start", "transcript_end"), 
> filters = "chromosome_name", values = c("2","3","MT"), mart =
> ensembl)$chromosome_name)
>
>    2    3   MT
> 5232 2179   41
>
>
> R> ens.r <- getBM(attributes = c("ensembl_transcript_id",
> "chromosome_name", "strand", "transcript_start", "transcript_end"), 
> mart = ensembl)
> R> ens.r <- ens.r[ens.r$chromosome_name %in% chroms,]
> R> table(ens.r$chromosome_name)
>
>    1   10   11   12   13   14   15   16   17   18   19    2    3    4
> 5    6    7    8    9   MT    X    Y
> 2507 1869 4364 1501 1630 1624 1404 1522 1865  985 1245 5232 2179 3997
> 2822 2524 3919 2021 2163   41 3297   17
>
>
>
> R> sessionInfo()
> R version 2.10.0 (2009-10-26)
> x86_64-unknown-linux-gnu
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
> LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=C
>
>  [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C
> LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
>
> other attached packages:
> [1] biomaRt_2.2.0
>
> loaded via a namespace (and not attached):
> [1] RCurl_1.3-0  tools_2.10.0 XML_2.6-0
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list