[BioC] biomaRt- incorrect number of transcripts

Steffen at stat.Berkeley.EDU Steffen at stat.Berkeley.EDU
Tue Nov 10 20:25:59 CET 2009


Dear Robert,

Would it be possible to check if there are duplicates in the result you
obtain via the web?  By default biomaRt will retrieve only unique results,
sometimes when you query over the web results are duplicated.  To remove
these you need to check the unique only checkbox when exporting your
results to a file.  Can you let me know if that explains the difference in
number of transcripts you notice?

Cheers,
Steffen

> Dear mailing list,
>
> I have recently observed a discrepancies in genome annotation obtained
> via R package biomaRt.
> I wanted to download all ensembl transcripts from the entire mouse
> genome (chromosome 1:19, X, Y MT only).
>
> When I set the filter based on chromosome names I retrieved ~36000
> transcript, please see the code below.
> However by using the web service www.biomart.org I received ~48000
> transcripts for the same genome version and chromosomes.
>
> By comparing these two data frames you could see that the discrepancies
> in number of transcripts occur only for some chromosomes (3:9 and X).
> If I specified only two chromosome names (2 and 3) than the number of
> downloaded transcripts is correct for both of them.
> If I did not set any filter in getBM function and did the filtering
> manually in R, the number of transcripts is correct.
>
> Session info is attached.
>
> Best Regards
> Robert
>
> --
> Robert Ivanek
> Postdoctoral Fellow Schuebeler Group
> Friedrich Miescher Institute
> Maulbeerstrasse 66
> 4058 Basel / Switzerland
> Office phone: +41 61 697 6100
>
>
> R> library("biomaRt")
> R> ensembl <- useMart("ensembl", dataset = "mmusculus_gene_ensembl")
> R> chroms <- c(1:19,"X","Y","MT")
> R> table(getBM(attributes = c("ensembl_transcript_id",
> "chromosome_name", "strand", "transcript_start", "transcript_end"),
> filters = "chromosome_name", values = chroms, mart =
> ensembl)$chromosome_name)
>
>    1   10   11   12   13   14   15   16   17   18   19    2    3    4
> 5    6    7    8    9   MT    X    Y
> 2507 1869 4364 1501 1630 1624 1404 1522 1865  985 1245 5232 1080 1454
> 845 1209 1487 1129 1031   41 2072   17
>
> R> ens.web <- read.delim("../../../mart_export.txt",stringsAsFactors=F)
> R> ens.web <- ens.web[ens.web$Chromosome.Name %in% chroms,]
> R> table(ens.web$Chromosome.Name)
>
>    1   10   11   12   13   14   15   16   17   18   19    2    3    4
> 5    6    7    8    9   MT    X    Y
> 2507 1869 4364 1501 1630 1624 1404 1522 1865  985 1245 5232 2179 3997
> 2822 2524 3919 2021 2163   41 3297   17
>
> R> table(getBM(attributes = c("ensembl_transcript_id",
> "chromosome_name", "strand", "transcript_start", "transcript_end"),
> filters = "chromosome_name", values = c("2","3","MT"), mart =
> ensembl)$chromosome_name)
>
>    2    3   MT
> 5232 2179   41
>
>
> R> ens.r <- getBM(attributes = c("ensembl_transcript_id",
> "chromosome_name", "strand", "transcript_start", "transcript_end"), mart
> = ensembl)
> R> ens.r <- ens.r[ens.r$chromosome_name %in% chroms,]
> R> table(ens.r$chromosome_name)
>
>    1   10   11   12   13   14   15   16   17   18   19    2    3    4
> 5    6    7    8    9   MT    X    Y
> 2507 1869 4364 1501 1630 1624 1404 1522 1865  985 1245 5232 2179 3997
> 2822 2524 3919 2021 2163   41 3297   17
>
>
>
> R> sessionInfo()
> R version 2.10.0 (2009-10-26)
> x86_64-unknown-linux-gnu
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
> LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=C
>
>  [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C
> LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
>
> other attached packages:
> [1] biomaRt_2.2.0
>
> loaded via a namespace (and not attached):
> [1] RCurl_1.3-0  tools_2.10.0 XML_2.6-0
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list