[BioC] grep GOID, R problem or my problem?

fhong@salk.edu fhong at salk.edu
Wed Nov 9 20:26:28 CET 2005


Hi list,
Anyone tried to "grep" a target GOID from a list of GOIDs? I found that it
made random mistakes, meaning it might grep a GOID which is not the target
one by random.

For example: ##AffyID.Ath1 is the affyid of all genes on ATH1 array
GO.Ath1=mget(AffyID.Ath1,ath1121501GO)
GOID.Ath1=sapply(GO.Ath1, function(x) {
    onts=sapply(x, function(z) z$GOID)
    })

GOID.finder=sapply(GOID.Ath1,function(x) {
     grep("GO:0000127",unlist(x),value=TRUE)
  })

The problem is, even when the nodes of GOID.Ath1 ( a list) doesn't contain
the target GOID, the grep will return a hit sometimes. The frequency is
very low, but it is there, and the returned hit are changing everytime I
re-run it, it looks like grep is making mistakes randomly.

I am listing the deatils below, I am sorry if the question is too long and
I appreciate your help.

I am using R2.1.0patched, and I am working wiht affy ATH1 array.


##########
library(ath1121501)
library(GO)
library(annotate)


#######
#Get all GOID whose GO term contains a word "transcription"
###########
##function
GOTerm2Tag=function(term)
{
  GTL=eapply(GOTERM,function(x) {
     grep(term,x at Term,value=TRUE)
  })
  Glen=sapply(GTL,length)
  names(GTL[Glen>0])
}

##usage
GO.trans=GOTerm2Tag("transcription")


################
#Get GOID for all genes on ATH1 array
#############
##AffID.ATH1 is all affyID for all genes on ATH1 array
GO.Ath1=mget(AffyID.Ath1,ath1121501GO)
GOID.Ath1=sapply(GO.Ath1, function(x) {
    onts=sapply(x, function(z) z$GOID)
    })

############
#Find all genes that contain GOID whose GO term contians "transcrption"
###############
##function: Search for genes that contain the target GOID
GOID2Gene=function(GOID,GOID.Ath1)
{
   GOID.finder=sapply(GOID.Ath1,function(x) {
     grep(GOID,unlist(x),value=TRUE)
  })
  GOID.finder.len=sapply(GOID.finder,length)
  temp=which(GOID.finder.len>0)
  names(temp)=names(which(GOID.finder.len>0))
  temp

}

#Function: check the target Goterm for a list of genes
Gene2GO=function(AffyID.input, AffyID.Ath1, GOID.Ath1,GOID.target)
{
   ##AffyID.input: AffyID of the genes that needs to be checked
   ##AffyID.Ath1: AffyID of all genes on Ath1 array
   ##GOID.Ath1:  GoID for all genes on Ath1
   ##GOID.target: GOID of target category. e.g. GO.trans

   index.input=pmatch(AffyID.input,AffyID.Ath1,nomatch=NA)
   ##index of input gene onthe array
   GOID.input=GOID.Ath1[index.input]  ##GOID for each of the input genes
   GOID.hit=sapply(GOID.input,function(x) {
          intersect(unlist(x),GOID.target)
         })
   num.hit=sapply(GOID.hit,length)
   GOinfo.hit=mget(unlist(GOID.hit),GOTERM)
   Goterm.hit=sapply(GOinfo.hit,function (x) Term(x))

   list(num.hit=num.hit,Goterm.hit=Goterm.hit)
}

##usage
len.trans=length(GO.trans)
record=rep(NA,1,len.trans)
index.trans.rec=c()
for ( i in 1:len.trans)
{
  temp.out=GOID2Gene(GO.trans[i],GOID.Ath1)
  temp.names=names(temp.out)
  index.trans.rec=c(index.trans.rec,temp.out)

  if(length(temp.out)>0){
      temp=Gene2GO(temp.names, AffyID.Ath1, GOID.Ath1,GO.trans)
     if( length(which(temp$num.hit==0))!=0)
         {record[i]=which(temp$num.hit==0)
           print(i) }
   }



  rm(temp.out,temp.names,temp)
}


So the last step will always tell that one or two genes that are
identified to have GOID for "transcription", actually don't have any GOID
for "transcription". And these one or two genes are changing every time I
re-run the program, and the number is chaning from 0 to 3. It looks like
that it make random mistakes by grepping different wrong thing at
different time.

I really appreciate if anyone can give me any hint!!
Thank you very much.

Fangxin







--------------------
Fangxin Hong  Ph.D.
Plant Biology Laboratory
The Salk Institute
10010 N. Torrey Pines Rd.
La Jolla, CA 92037
E-mail: fhong at salk.edu
(Phone): 858-453-4100 ext 1105



More information about the Bioconductor mailing list