[BioC] GoStats and microRNA pipeline using Biomart

Thu Mar 31 21:23:03 CEST 2011

Hi David,

If this was your function you would 1st of all want to just pass in a 
big vector (with your universe of transcript IDs in it) to get out all 
the data.  Then making the GOFrame is just a matter of taking all the 
Gene IDs (entrez gene IDs) and all the GO IDs (from any of the three 
ontologies), and the evidence codes into a single data.frame as outlined 
in this document here:

http://www.bioconductor.org/packages/2.7/bioc/vignettes/GOstats/inst/doc/GOstatsForUnsupportedOrganisms.pdf

But if it were me, I would attempt to save a little headache for making 
the final table, but just getting only the data I needed from getBM (and 
since they keep the three ontologies separate, that means I would make 
three calls to get BM.  So like this

getBioProcgoids <- function (id) {
   getBM(attributes=c(
           'go_biological_process_id',
           'go_biological_process_linkage_type',
           'entrezgene')
         ,filters="ensembl_transcript_id",  values=id,  mart=mart)
}
BioGOs <- getBioProcgoids( 
yourBigUniverseVectorOfEnsemblTranscriptIDsGoesHere )

Then do separate small functions to get the other two ontologies and 
call them etc.

Then something like this:

myGOFrame <- rbind(BioGOs, CCGOs, MFGOs)

To stick them all together.

Does that help?

   Marc

On 03/31/2011 02:47 AM, David martin wrote:
> Ok thanks,
> Any idea on how to turn the biomart output into a valid GOFrame input ??
>
> For example :
> I wrote this function
>
> getgoids <- function (id) {
>   getBM(attributes=c(
>           'entrezgene',
>           'ensembl_transcript_id',
>           'go_biological_process_id',
>           'go_biological_process_linkage_type',
>           'go_cellular_component_id',
>           'go_cellular_component_linkage_type',
>           'go_molecular_function_id',
>           'go_molecular_function_linkage_type')
>         ,filters="ensembl_transcript_id",  values=id,  mart=mart)
> }
> foo
>
> How do i turn this into a valid GOFrame Object ?
>
> thanks,
> david
>
>
>
>
> On 03/31/2011 10:10 AM, James F. Reid wrote:
>> Hi David,
>>
>> On 03/30/2011 08:31 PM, David martin wrote:
>> > Yes absolutly. A few ensembl releases ago UTR tend to be smaller but
>> > this is getting better now. How would you normalize that based on
>> length ?
>>
>> I'm afraid that I don't have a simple answer to this it would need
>> thinking out especially wrt to your GO enrichment analysis.
>> Any ideas from the members of the list?
>>
>> Best,
>> J.
>>
>>> On 03/30/2011 07:00 PM, James F. Reid wrote:
>>>> Hi David,
>>>>
>>>> I understand your reasoning for counting the number of miRNA binding
>>>> sites with the 3' UTR of a predicted target, you are trying to include
>>>> the 'combinatorial' effect of miRNA targeting.
>>>> I would try to include the length of any UTR however (some kind of
>>>> normalization if you wish) since the longer the UTR the more 
>>>> chances are
>>>> that miRNA will bind.
>>>> Does this make sense?
>>>>
>>>> Best,
>>>> J.
>>>>
>>>> On 03/30/2011 05:23 PM, David martin wrote:
>>>>> On 03/30/2011 04:56 PM, Steve Lianoglou wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On Wed, Mar 30, 2011 at 9:43 AM, David
>>>>>> martin<vilanew at gmail.com> wrote:
>>>>>>> Hi,
>>>>>>> I open this new discussion so not to confuse with the previous one.
>>>>>>>
>>>>>>> The objective here is to look for overrepresented GoTerms from
>>>>>>> microRNA
>>>>>>> targets. One microRNA can have several targets (genes) and one 
>>>>>>> single
>>>>>>> gene
>>>>>>> can be targeted by several microRNAs. The assumption is to check
>>>>>>> for a
>>>>>>> specific microRNAs which GoTerms are overrepresented.
>>>>>>>
>>>>>>>
>>>>>>> Ok so let's say me my microRNA of interest is mir-A.
>>>>>>>
>>>>>>> Step1: based on my favorite prediction algorithm i have managed to
>>>>>>> get a
>>>>>>> list of genes targeted by mir-A. The genes are ensembl transcripts
>>>>>>> and as i
>>>>>>> said before miR-A can target several times the same transcript (at
>>>>>>> different
>>>>>>> location) so i need to account for this.
>>>>>>>
>>>>>>> miR-A targets ->
>>>>>>> ENST001,ENST001,ENST001,ENST0025,ENST089,ENST099,ENST0099......) up
>>>>>>> to 300
>>>>>>> different transcripts.
>>>>>>
>>>>>> I don't get why you'd want to have the same transcript multiple 
>>>>>> times
>>>>>> as a target for the miRNA -- if the miRNA targets the same 
>>>>>> transcript
>>>>>> in two different locations, you then want to double count the GO 
>>>>>> terms
>>>>>> associated with that transcript?
>>>>>
>>>>> That's correct. The idea behind that is that a transcript targeted at
>>>>> different locations is more "likely to be twice targeted" and 
>>>>> therefore
>>>>> GO term associated to this transcript have to be replicated. This 
>>>>> sound
>>>>> good to me but i don not expect that you agree on that.
>>>>>
>>>>>
>>>>> i have managed to get all GO ids with a small function. Basically you
>>>>> input one transcript id in a loop
>>>>>
>>>>> l = length(genes) # list of all ensembl transcripts
>>>>> for (l in 1:l)
>>>>> {
>>>>> goid[l] <- getgoids("ENST...")
>>>>>
>>>>> }
>>>>> getgoids <- function (id) {
>>>>> getBM(attributes=c(
>>>>> 'go_biological_process_id',
>>>>> 'go_biological_process_linkage_type',
>>>>> 'go_cellular_component_id',
>>>>> 'go_cellular_component_linkage_type',
>>>>> 'go_molecular_function_id',
>>>>> 'go_molecular_function_linkage_type')
>>>>> ,filters="ensembl_transcript_id", values=id, mart=mart)
>>>>> }
>>>>>
>>>>> I agree wioth you that i might need to add the transcript_id to be 
>>>>> able
>>>>> to use for GoStats mapping between transcripts and GO ids.
>>>>>
>>>>>
>>>>> Now i want to use that as the univere set for GoStats and do 
>>>>> hyperG to
>>>>> compare with the GO for a specific microRNA.
>>>>>
>>>>> I guess :
>>>>>
>>>>> goframeData = data.frame(frame$go_id, frame$Evidence, frame$gene_id)
>>>>> #list of all GOids from all transcripts targeted by all microRNA
>>>>>
>>>>> goFrame = GOFrame(goframeData, organism = "Homo sapiens")
>>>>> goAllFrame = GOAllFrame(goFrame) #Geneid to ALL go id mapping
>>>>>
>>>>>
>>>>> In the GSEAGOHyperGParams function below can you correct me ?
>>>>> geneSetCollection = List of all go ids off all transcripts 
>>>>> targetted by
>>>>> all microRNA
>>>>> single_mir_transcript_ids = list of ENSEMBl transcripts ids 
>>>>> targeted by
>>>>> a specific microRNA
>>>>> univerGeneIds: list of transcript to Go mapping
>>>>> Is this correc t?
>>>>>
>>>>>
>>>>> gsc <- GeneSetCollection(goAllFrame, setType = GOCollection())
>>>>> params <- GSEAGOHyperGParams(name = "My Custom GSEA based annot
>>>>> Params",geneSetCollection = gsc, geneIds = 
>>>>> single_mir_transcripts_ids,
>>>>> universeGeneIds = universe,ontology = "BP", pvalueCutoff = 0.05,
>>>>> conditional = FALSE,testDirection = "over")
>>>>>
>>>>>
>>>>>>
>>>>>> Somehow that seems wrong to me -- if the "hit count" of the miRNA to
>>>>>> the transcript is important to you, one thing you can do is store 
>>>>>> your
>>>>>> miR-A vector as its "table()" so the names will the the transcripts,
>>>>>> and the values will be the number of hits.
>>>>>>
>>>>>>> I use biomart to get the corresponding GoIds for these transcripts
>>>>>>>
>>>>>>> ....
>>>>>>> #Select mart database
>>>>>>> mart<- useMart("ensembl", dataset="hsapiens_gene_ensembl")
>>>>>>>
>>>>>>> #Get go for a specific transcript
>>>>>>> # First problem as Biomart will not return twice GoTerms for
>>>>>>> duplicated
>>>>>>> transcripts. The example below show that for transcript
>>>>>>> c("ENST00000347770","ENST00000347770") i get the same goTerms than
>>>>>>> for
>>>>>>> transcript c("ENST00000347770").
>>>>>>> # As i said before a microRNA can target several times the same
>>>>>>> microRNA so
>>>>>>> twice the number of goterms associated to this particular microRNA.
>>>>>>> Can we
>>>>>>> force biomart to return redundant GoTerms ????
>>>>>>
>>>>>> I'm actually still not sure what you want to do, but if you 
>>>>>> follow my
>>>>>> advice above, you can manipulate the data.frame you get from 
>>>>>> getBM to
>>>>>> replicate rows (or whatever you're trying to do).
>>>>>>
>>>>>> You will also want to add "ensembl_transcript_id" to your vector of
>>>>>> attributes so you can reassociate the rows in the table that is
>>>>>> returned to you with your original ensembl transcripts you are
>>>>>> querying for, eg:
>>>>>>
>>>>>> R> gomir<- getBM(attributes=c('ensembl_transcript_id', 'go..', ...),
>>>>>> filters='ensemble_transcript_id', values=c("ENST..."), mart=mart)
>>>>>>
>>>>>> Hope that helps,
>>>>>> -steve
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor