[BioC] bioMart GO inconsistency, normal?

R Tagett rtt at wayne.edu
Fri Aug 9 01:30:29 CEST 2013


Hello,
I am a graduate student at Wayne State University in Detroit. I am running BioMart to collect GO terms for lists of genes and I noticed that some genes are annotated with top nodes (eg "biological_process") and others are not. I wonder if any one can tell me why. 

An example code and my sessionInfo are below. In this example, I collect all human HUGO gene symbols using the HGNChelper package. From those, I use BioMart to get the GO terms for these genes, and take only the "biological_process" (BP) annotations. 
There are 15368 unique genes that have BP annotations (uniqGenesInGO).
Then , I split the list of all BP annotations into those which include "GO:0008150" (which is the "biological_process" term), and those which do not. 
596 genes are annotated with "GO:0008150", and 14772 are not. This is inconsistent!

Thanks for your help,
Becky


library("biomaRt")
ensembl = useMart("ensembl", dataset = "hsapiens_gene_ensembl")

 library(HGNChelper)
 data(hgnc.table) # gives names of all approved symbols
 allHgnc <- unique(hgnc.table[,2])  
 allGOhgnc <-  getBM(attributes = c("go_id", "go_linkage_type", "entrezgene","hgnc_symbol","namespace_1003"), filters = "hgnc_symbol", values = allHgnc, mart = ensembl)
# load("allGOhgnc.RData")     266749

BP <- allGOhgnc[which(allGOhgnc$namespace_1003 == "biological_process"),]   # 126789 human BP terms in GO
BP<-BP[!duplicated(BP),]  # 108019
uniqGenesInGO <- unique(BP$hgnc_symbol)   #  15368

# "GO:0008150"  is "biological_process"

hasBPtab <- BP[which(BP$go_id == "GO:0008150"),  ]
hasBP<- unique(hasBPtab$hgnc_symbol)
 length(hasBP)   #   596

 noBPtab<-BP[ -which(BP$hgnc_symbol %in% hasBP),  ]
length(unique(noBPtab$hgnc_symbol))  #  14772

# 14772 + 596 = 15368
# why are some genes annotated with the top node and others are not??


> sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] biomaRt_2.16.0

loaded via a namespace (and not attached):
[1] RCurl_1.95-4.1 XML_3.96-1.1



More information about the Bioconductor mailing list