[BioC] biomart to a data.frame

Sebastian Thieme thieme at mi.fu-berlin.de
Thu Jan 26 09:52:47 CET 2012


Hi Assa,

you can try this

con <- textConnection(data2seperate)
seperatedData <- read.table(con,sep=";",stringsAsFactors=FALSE) #splitten

It's nearly the same as the strsplit function but you get a table as
output sorted by your input. I hope this helps.

Best

Basti


2012/1/26 Assa Yeroslaviz <frymor at gmail.com>:
> Hi Steve,
>
> thanks for the help.
>
> I know about the strsplit function and i used it to split each row on its
> own by the ';' symbol.
> The problem I have is that I need to keep the information of each row in
> the row ( or at least to give it back after the biomaRt extraction).
>
> The table I have contains not only the protein IDs but also a lot of other
> stuff, which is connected to each of the proteins. This is why I need to
> know which proteins came from which line (Id).
>
> It will be nice if there was a possibility to do it as you suggested. Take
> all the Protein IDs, write them into one vector and run them with biomaRt.
> But than I would like to be able to put them back together in a row-wise
> fashion like I suggested at the beginning.
>
> Thanks again
> Assa
>
> On Wed, Jan 25, 2012 at 16:02, Steve Lianoglou <
> mailinglist.honeypot at gmail.com> wrote:
>
>> Hi Assa,
>>
>> Sorry for top posting.
>>
>> Your intuition is correct: you should not being querying biomart
>> inside a for loop. The idea is to create one query for all of your
>> protein IDs, and query it once.
>>
>> This is how you might go about it. First, let's look at the protein
>> IDs you already seem to have somewhere:
>>
>> > 45  FBpp0070037
>> > 46  FBpp0070039;FBpp0070040
>> > 47  FBpp0070041;FBpp0070042;FBpp0070043
>> > 48  FBpp0070044;FBpp0110571
>>
>> It seems you have multiple IDs jammed into one column of a data.frame
>> maybe? The rows which have more than one ID, (eg.
>> "FBpp0070039;FBpp0070040") will have to be split up so that each row
>> (or element in a vector) only has one ID. Look into using `strsplit`.
>>
>> You will need to get a character vector of protein ids -- one protein
>> per bin, it might look like so:
>>
>> pids <- c('FBpp0070037', 'FBpp0070039', 'FBpp0070040', 'FBpp0070041',
>>          'FBpp0070042', 'FBpp0070043')
>>
>> Now ... you're basically done. Let's rig up an object to query biomart
>> with:
>>
>> library(biomaRt)
>> mart <- useMart('ensembl', dataset='dmelanogaster_gene_ensembl')
>> ans <-
>> getBM(attributes=c("flybase_translation_id","flybase_gene_id","flybasename_gene"),
>>                     filters="flybase_translation_id", values=pids,
>> mart=mart)
>>
>> Your answer will look like so:
>>
>>  flybase_translation_id flybase_gene_id flybasename_gene
>> 1            FBpp0070037     FBgn0010215        alpha-Cat
>> 2            FBpp0070039     FBgn0052230          CG32230
>> 3            FBpp0070040     FBgn0052230          CG32230
>> 4            FBpp0070041     FBgn0000258        CkIIalpha
>> 5            FBpp0070042     FBgn0000258        CkIIalpha
>> 6            FBpp0070043     FBgn0000258        CkIIalpha
>>
>> Now you're left with figuring out what to do with multiple
>> "flybase_translaion_id"s that map to the same "flybasename_gene".
>>
>> You would have to do this anyway, but the key point here is that you
>> can now do it without querying biomart in a loop.
>>
>> HTH,
>> -steve
>>
>>
>>
>> > For each of these protein Ids (FBpp...), I would like to extract the gene
>> > id (Fbgn....) in a third column. the output table should looks like that:
>> >
>> > 45  FBpp0070037                          FBgn001234
>> > 46  FBpp0070039;FBpp0070040              FBgn00094432;FBgn002345
>> > 47  FBpp0070041;FBpp0070042;FBpp0070043
>>  FBgn0001936;FBgn000102;FBgn004527
>> > 48  FBpp0070044;FBpp0110571              FBgn0097234;FBgn00183
>> > ...
>> >
>> > I was thinking using biomaRt, but I could find a way of automating it for
>> > the complete protein ids in the line.
>> >
>> > What I have done so far is this for loop:
>> >
>> > for(i in 1:dim(data)[1]){
>> >  temp=unlist(strsplit(data[i,2],";"))
>> >  temp= gsub("REV__", "", temp)
>> >  result=
>> >
>> getBM(attributes=c("flybase_translation_id","flybase_gene_id","flybasename_gene"),filters="flybase_translation_id",values=temp,
>> > mart=mart, )
>> >      charresult =""
>> >      for (j in 1:length(result[[1]])) {
>> > #          charresult<-paste(charresult,">",
>> > result[[1]][j],":",result[[2]][j], "\t", sep="")
>> >          charresult<-paste(charresult, result[[2]][j], ";", sep="")
>> >          }
>> >      out<-"CompleteResults.txt"
>> >      cat("line: ", i-1,"\t", "was written to ");cat(out);cat("\n")
>> >      write.table(paste(i-1, charresult, sep="\t"),out, sep="\t", quote=F,
>> > col.names=F, row.names=F,append=T)
>> >    }
>> >
>> > What I am doing is converting the string of FBpp Ids into a character
>> > vector and than run each line into the getBM command. I first think it
>> is a
>> > bad idea, as I am using a loop to inquire an online data base, but i
>> don't
>> > have a better option at the moment.
>> >
>> > The second problem is that it just takes a lot of time.
>> >
>> > I would appreciate your Ideas, If there is a better/faster way of doing
>> it
>> >
>> > Thanks A.
>> >
>> >        [[alternative HTML version deleted]]
>> >
>> > _______________________________________________
>> > Bioconductor mailing list
>> > Bioconductor at r-project.org
>> > https://stat.ethz.ch/mailman/listinfo/bioconductor
>> > Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>>
>> --
>> Steve Lianoglou
>> Graduate Student: Computational Systems Biology
>>  | Memorial Sloan-Kettering Cancer Center
>>  | Weill Medical College of Cornell University
>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>
>
>        [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list