[BioC] Problem with filtering genes in biomaRt

Pieta Schofield [guest] guest at bioconductor.org
Tue Feb 4 18:45:16 CET 2014


Hullo

I am getting some inconsistent results with my attempts to filter genes by chromosome when retrieving them form "hsapiens_gene_ensembl"

if I specify a list of chromosomes to retrieve from I get a different number for some of the chromosomes than if i retrieve all the genes for all chromosomes or if I retrieve the genes from the chromosomes individually. 

this is my full code of the problem

thanks in advance

#!/usr/bin/Rscript --vanilla
require(biomaRt)

.getGenes <- function(chrs=list())
{ # use biomart to get ranges for feature
  hg19 <- useMart("ensembl",dataset="hsapiens_gene_ensembl")
  if(length(chrs)>0)
  {
    getBM(attributes=c("chromosome_name",
                              "start_position",
                              "end_position",
                              "strand",
                              "ensembl_gene_id",
                              "external_gene_id",
                              "gene_biotype"), 
          filter="chromosome_name",
          values=chrs,
          mart=hg19)
  }else{
    getBM(attributes=c("chromosome_name",
                              "start_position",
                              "end_position",
                              "strand",
                              "ensembl_gene_id",
                              "external_gene_id",
                              "gene_biotype"), 
          mart=hg19)
  }
}

unfiltered<-.getGenes()

chrs<-c(unlist(seq(1,22)),"X","Y","MT")
filtered<-.getGenes(chrs)

chronly<-lapply(chrs,
            function(x){
              length(.getGenes(x)$ensembl_gene_id)
            })
names(chronly)<-chrs

allgenes<-table(unfiltered$chromosome_name)
missinggenes<-table(unfiltered$chromosome_name[which(
                                  !(unfiltered$ensembl_gene_id
                                                   %in%
                                    filtered$ensembl_gene_id))])

unlist(lapply(chrs,
  function(x){
    totalmissing<-ifelse(is.na(missinggenes[x]),0,missinggenes[x])
    paste("Missing genes on chr",x,totalmissing,"of",
          "individually",chronly[[x]],"all",allgenes[x])
  }))

sessionInfo()






 -- output of sessionInfo(): 

Loading required package: biomaRt
Loading required package: methods
 [1] "Missing genes on chr 1 0 of individually 5321 all 5321"   
 [2] "Missing genes on chr 2 0 of individually 3990 all 3990"   
 [3] "Missing genes on chr 3 0 of individually 3043 all 3043"   
 [4] "Missing genes on chr 4 0 of individually 2521 all 2521"   
 [5] "Missing genes on chr 5 0 of individually 2856 all 2856"   
 [6] "Missing genes on chr 6 0 of individually 2906 all 2906"   
 [7] "Missing genes on chr 7 0 of individually 2818 all 2818"   
 [8] "Missing genes on chr 8 607 of individually 2385 all 2385" 
 [9] "Missing genes on chr 9 1892 of individually 2303 all 2303"
[10] "Missing genes on chr 10 0 of individually 2216 all 2216"  
[11] "Missing genes on chr 11 0 of individually 3190 all 3190"  
[12] "Missing genes on chr 12 0 of individually 2819 all 2819"  
[13] "Missing genes on chr 13 0 of individually 1217 all 1217"  
[14] "Missing genes on chr 14 0 of individually 2237 all 2237"  
[15] "Missing genes on chr 15 0 of individually 2076 all 2076"  
[16] "Missing genes on chr 16 0 of individually 2360 all 2360"  
[17] "Missing genes on chr 17 0 of individually 2901 all 2901"  
[18] "Missing genes on chr 18 0 of individually 1113 all 1113"  
[19] "Missing genes on chr 19 0 of individually 2917 all 2917"  
[20] "Missing genes on chr 20 0 of individually 1322 all 1322"  
[21] "Missing genes on chr 21 0 of individually 720 all 720"    
[22] "Missing genes on chr 22 0 of individually 1208 all 1208"  
[23] "Missing genes on chr X 2028 of individually 2414 all 2414"
[24] "Missing genes on chr Y 506 of individually 506 all 506"   
[25] "Missing genes on chr MT 37 of individually 37 all 37"     
R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] methods   stats     graphics  grDevices utils     datasets  base     

other attached packages:
[1] biomaRt_2.18.0

loaded via a namespace (and not attached):
[1] RCurl_1.95-4.1 XML_3.95-0.2  

--
Sent via the guest posting facility at bioconductor.org.



More information about the Bioconductor mailing list