[BioC] biomart to a data.frame

Steve Lianoglou mailinglist.honeypot at gmail.com
Wed Jan 25 16:02:30 CET 2012


Hi Assa,

Sorry for top posting.

Your intuition is correct: you should not being querying biomart
inside a for loop. The idea is to create one query for all of your
protein IDs, and query it once.

This is how you might go about it. First, let's look at the protein
IDs you already seem to have somewhere:

> 45  FBpp0070037
> 46  FBpp0070039;FBpp0070040
> 47  FBpp0070041;FBpp0070042;FBpp0070043
> 48  FBpp0070044;FBpp0110571

It seems you have multiple IDs jammed into one column of a data.frame
maybe? The rows which have more than one ID, (eg.
"FBpp0070039;FBpp0070040") will have to be split up so that each row
(or element in a vector) only has one ID. Look into using `strsplit`.

You will need to get a character vector of protein ids -- one protein
per bin, it might look like so:

pids <- c('FBpp0070037', 'FBpp0070039', 'FBpp0070040', 'FBpp0070041',
          'FBpp0070042', 'FBpp0070043')

Now ... you're basically done. Let's rig up an object to query biomart with:

library(biomaRt)
mart <- useMart('ensembl', dataset='dmelanogaster_gene_ensembl')
ans <- getBM(attributes=c("flybase_translation_id","flybase_gene_id","flybasename_gene"),
                     filters="flybase_translation_id", values=pids, mart=mart)

Your answer will look like so:

  flybase_translation_id flybase_gene_id flybasename_gene
1            FBpp0070037     FBgn0010215        alpha-Cat
2            FBpp0070039     FBgn0052230          CG32230
3            FBpp0070040     FBgn0052230          CG32230
4            FBpp0070041     FBgn0000258        CkIIalpha
5            FBpp0070042     FBgn0000258        CkIIalpha
6            FBpp0070043     FBgn0000258        CkIIalpha

Now you're left with figuring out what to do with multiple
"flybase_translaion_id"s that map to the same "flybasename_gene".

You would have to do this anyway, but the key point here is that you
can now do it without querying biomart in a loop.

HTH,
-steve



> For each of these protein Ids (FBpp...), I would like to extract the gene
> id (Fbgn....) in a third column. the output table should looks like that:
>
> 45  FBpp0070037                          FBgn001234
> 46  FBpp0070039;FBpp0070040              FBgn00094432;FBgn002345
> 47  FBpp0070041;FBpp0070042;FBpp0070043  FBgn0001936;FBgn000102;FBgn004527
> 48  FBpp0070044;FBpp0110571              FBgn0097234;FBgn00183
> ...
>
> I was thinking using biomaRt, but I could find a way of automating it for
> the complete protein ids in the line.
>
> What I have done so far is this for loop:
>
> for(i in 1:dim(data)[1]){
>  temp=unlist(strsplit(data[i,2],";"))
>  temp= gsub("REV__", "", temp)
>  result=
> getBM(attributes=c("flybase_translation_id","flybase_gene_id","flybasename_gene"),filters="flybase_translation_id",values=temp,
> mart=mart, )
>      charresult =""
>      for (j in 1:length(result[[1]])) {
> #          charresult<-paste(charresult,">",
> result[[1]][j],":",result[[2]][j], "\t", sep="")
>          charresult<-paste(charresult, result[[2]][j], ";", sep="")
>          }
>      out<-"CompleteResults.txt"
>      cat("line: ", i-1,"\t", "was written to ");cat(out);cat("\n")
>      write.table(paste(i-1, charresult, sep="\t"),out, sep="\t", quote=F,
> col.names=F, row.names=F,append=T)
>    }
>
> What I am doing is converting the string of FBpp Ids into a character
> vector and than run each line into the getBM command. I first think it is a
> bad idea, as I am using a loop to inquire an online data base, but i don't
> have a better option at the moment.
>
> The second problem is that it just takes a lot of time.
>
> I would appreciate your Ideas, If there is a better/faster way of doing it
>
> Thanks A.
>
>        [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact



More information about the Bioconductor mailing list