[BioC] Problem with filtering genes in biomaRt

Steve Lianoglou lianoglou.steve at gene.com
Fri Feb 7 19:20:25 CET 2014


Hi,

In addition to Thomas' advice below, you might find it helpful to use
some of the offline resources Bioconductor makes available for these
purposes.

For instance, it looks like you want to retrieve all of the genes (and
their genomic coordinates) for human.

The GenomicFeatures package can compile this information for you from
different online resources (ie. ucsc or biomart) and will construct an
object that you can query offline.

There is already a "transcript database" package constructed for you
using UCSC gene models:

http://bioconductor.org/packages/release/data/annotation/html/TxDb.Hsapiens.UCSC.hg19.knownGene.html

But you could also build a custom transcript db package using ensembl
transcripts alone. Reading through the documentation (vignettes)
available here should get you started:

http://bioconductor.org/packages/release/bioc/html/GenomicFeatures.html

HTH,
-steve

On Fri, Feb 7, 2014 at 2:16 AM, Thomas Maurel <maurel at ebi.ac.uk> wrote:
> Dear Pieta,
>
> As you have noticed in your queries, there is a danger of getting truncated results with biomart when querying a big organisms such as human without any filters. To be sure that you get all the genes back I would advise you to filter on each chromosomes individually.
>
> Hope this helps,
> Thomas
> On 4 Feb 2014, at 17:45, "Pieta Schofield [guest]" <guest at bioconductor.org> wrote:
>
>>
>> Hullo
>>
>> I am getting some inconsistent results with my attempts to filter genes by chromosome when retrieving them form "hsapiens_gene_ensembl"
>>
>> if I specify a list of chromosomes to retrieve from I get a different number for some of the chromosomes than if i retrieve all the genes for all chromosomes or if I retrieve the genes from the chromosomes individually.
>>
>> this is my full code of the problem
>>
>> thanks in advance
>>
>> #!/usr/bin/Rscript --vanilla
>> require(biomaRt)
>>
>> .getGenes <- function(chrs=list())
>> { # use biomart to get ranges for feature
>>  hg19 <- useMart("ensembl",dataset="hsapiens_gene_ensembl")
>>  if(length(chrs)>0)
>>  {
>>    getBM(attributes=c("chromosome_name",
>>                              "start_position",
>>                              "end_position",
>>                              "strand",
>>                              "ensembl_gene_id",
>>                              "external_gene_id",
>>                              "gene_biotype"),
>>          filter="chromosome_name",
>>          values=chrs,
>>          mart=hg19)
>>  }else{
>>    getBM(attributes=c("chromosome_name",
>>                              "start_position",
>>                              "end_position",
>>                              "strand",
>>                              "ensembl_gene_id",
>>                              "external_gene_id",
>>                              "gene_biotype"),
>>          mart=hg19)
>>  }
>> }
>>
>> unfiltered<-.getGenes()
>>
>> chrs<-c(unlist(seq(1,22)),"X","Y","MT")
>> filtered<-.getGenes(chrs)
>>
>> chronly<-lapply(chrs,
>>            function(x){
>>              length(.getGenes(x)$ensembl_gene_id)
>>            })
>> names(chronly)<-chrs
>>
>> allgenes<-table(unfiltered$chromosome_name)
>> missinggenes<-table(unfiltered$chromosome_name[which(
>>                                  !(unfiltered$ensembl_gene_id
>>                                                   %in%
>>                                    filtered$ensembl_gene_id))])
>>
>> unlist(lapply(chrs,
>>  function(x){
>>    totalmissing<-ifelse(is.na(missinggenes[x]),0,missinggenes[x])
>>    paste("Missing genes on chr",x,totalmissing,"of",
>>          "individually",chronly[[x]],"all",allgenes[x])
>>  }))
>>
>> sessionInfo()
>>
>>
>>
>>
>>
>>
>> -- output of sessionInfo():
>>
>> Loading required package: biomaRt
>> Loading required package: methods
>> [1] "Missing genes on chr 1 0 of individually 5321 all 5321"
>> [2] "Missing genes on chr 2 0 of individually 3990 all 3990"
>> [3] "Missing genes on chr 3 0 of individually 3043 all 3043"
>> [4] "Missing genes on chr 4 0 of individually 2521 all 2521"
>> [5] "Missing genes on chr 5 0 of individually 2856 all 2856"
>> [6] "Missing genes on chr 6 0 of individually 2906 all 2906"
>> [7] "Missing genes on chr 7 0 of individually 2818 all 2818"
>> [8] "Missing genes on chr 8 607 of individually 2385 all 2385"
>> [9] "Missing genes on chr 9 1892 of individually 2303 all 2303"
>> [10] "Missing genes on chr 10 0 of individually 2216 all 2216"
>> [11] "Missing genes on chr 11 0 of individually 3190 all 3190"
>> [12] "Missing genes on chr 12 0 of individually 2819 all 2819"
>> [13] "Missing genes on chr 13 0 of individually 1217 all 1217"
>> [14] "Missing genes on chr 14 0 of individually 2237 all 2237"
>> [15] "Missing genes on chr 15 0 of individually 2076 all 2076"
>> [16] "Missing genes on chr 16 0 of individually 2360 all 2360"
>> [17] "Missing genes on chr 17 0 of individually 2901 all 2901"
>> [18] "Missing genes on chr 18 0 of individually 1113 all 1113"
>> [19] "Missing genes on chr 19 0 of individually 2917 all 2917"
>> [20] "Missing genes on chr 20 0 of individually 1322 all 1322"
>> [21] "Missing genes on chr 21 0 of individually 720 all 720"
>> [22] "Missing genes on chr 22 0 of individually 1208 all 1208"
>> [23] "Missing genes on chr X 2028 of individually 2414 all 2414"
>> [24] "Missing genes on chr Y 506 of individually 506 all 506"
>> [25] "Missing genes on chr MT 37 of individually 37 all 37"
>> R version 3.0.2 (2013-09-25)
>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>>
>> locale:
>> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
>>
>> attached base packages:
>> [1] methods   stats     graphics  grDevices utils     datasets  base
>>
>> other attached packages:
>> [1] biomaRt_2.18.0
>>
>> loaded via a namespace (and not attached):
>> [1] RCurl_1.95-4.1 XML_3.95-0.2
>>
>> --
>> Sent via the guest posting facility at bioconductor.org.
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> --
> Thomas Maurel
> Bioinformatician - Ensembl Production Team
> European Bioinformatics Institute (EMBL-EBI)
> European Molecular Biology Laboratory
> Wellcome Trust Genome Campus
> Hinxton
> Cambridge CB10 1SD
> United Kingdom
>
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



-- 
Steve Lianoglou
Computational Biologist
Genentech



More information about the Bioconductor mailing list