[BioC] GoStats and microRNA pipeline using Biomart

David martin vilanew at gmail.com
Fri Apr 1 11:46:03 CEST 2011


Hi marc,
Thanks for the tip, i realized that my function was too slow. I agree 
with you. Getting all GO in one shot is a much better approach.

On 03/31/2011 09:23 PM, Marc Carlson wrote:
> Hi David,
>
> If this was your function you would 1st of all want to just pass in a
> big vector (with your universe of transcript IDs in it) to get out all
> the data. Then making the GOFrame is just a matter of taking all the
> Gene IDs (entrez gene IDs) and all the GO IDs (from any of the three
> ontologies), and the evidence codes into a single data.frame as outlined
> in this document here:
>
> http://www.bioconductor.org/packages/2.7/bioc/vignettes/GOstats/inst/doc/GOstatsForUnsupportedOrganisms.pdf
>
>
> But if it were me, I would attempt to save a little headache for making
> the final table, but just getting only the data I needed from getBM (and
> since they keep the three ontologies separate, that means I would make
> three calls to get BM. So like this
>
> getBioProcgoids <- function (id) {
> getBM(attributes=c(
> 'go_biological_process_id',
> 'go_biological_process_linkage_type',
> 'entrezgene')
> ,filters="ensembl_transcript_id", values=id, mart=mart)
> }
> BioGOs <- getBioProcgoids(
> yourBigUniverseVectorOfEnsemblTranscriptIDsGoesHere )
>
> Then do separate small functions to get the other two ontologies and
> call them etc.
>
> Then something like this:
>
> myGOFrame <- rbind(BioGOs, CCGOs, MFGOs)
>
> To stick them all together.
>
> Does that help?
>
>
> Marc
>
>
>
> On 03/31/2011 02:47 AM, David martin wrote:
>> Ok thanks,
>> Any idea on how to turn the biomart output into a valid GOFrame input ??
>>
>> For example :
>> I wrote this function
>>
>> getgoids <- function (id) {
>> getBM(attributes=c(
>> 'entrezgene',
>> 'ensembl_transcript_id',
>> 'go_biological_process_id',
>> 'go_biological_process_linkage_type',
>> 'go_cellular_component_id',
>> 'go_cellular_component_linkage_type',
>> 'go_molecular_function_id',
>> 'go_molecular_function_linkage_type')
>> ,filters="ensembl_transcript_id", values=id, mart=mart)
>> }
>> foo
>>
>> How do i turn this into a valid GOFrame Object ?
>>
>> thanks,
>> david
>>
>>
>>
>>
>> On 03/31/2011 10:10 AM, James F. Reid wrote:
>>> Hi David,
>>>
>>> On 03/30/2011 08:31 PM, David martin wrote:
>>> > Yes absolutly. A few ensembl releases ago UTR tend to be smaller but
>>> > this is getting better now. How would you normalize that based on
>>> length ?
>>>
>>> I'm afraid that I don't have a simple answer to this it would need
>>> thinking out especially wrt to your GO enrichment analysis.
>>> Any ideas from the members of the list?
>>>
>>> Best,
>>> J.
>>>
>>>> On 03/30/2011 07:00 PM, James F. Reid wrote:
>>>>> Hi David,
>>>>>
>>>>> I understand your reasoning for counting the number of miRNA binding
>>>>> sites with the 3' UTR of a predicted target, you are trying to include
>>>>> the 'combinatorial' effect of miRNA targeting.
>>>>> I would try to include the length of any UTR however (some kind of
>>>>> normalization if you wish) since the longer the UTR the more
>>>>> chances are
>>>>> that miRNA will bind.
>>>>> Does this make sense?
>>>>>
>>>>> Best,
>>>>> J.
>>>>>
>>>>> On 03/30/2011 05:23 PM, David martin wrote:
>>>>>> On 03/30/2011 04:56 PM, Steve Lianoglou wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On Wed, Mar 30, 2011 at 9:43 AM, David
>>>>>>> martin<vilanew at gmail.com> wrote:
>>>>>>>> Hi,
>>>>>>>> I open this new discussion so not to confuse with the previous one.
>>>>>>>>
>>>>>>>> The objective here is to look for overrepresented GoTerms from
>>>>>>>> microRNA
>>>>>>>> targets. One microRNA can have several targets (genes) and one
>>>>>>>> single
>>>>>>>> gene
>>>>>>>> can be targeted by several microRNAs. The assumption is to check
>>>>>>>> for a
>>>>>>>> specific microRNAs which GoTerms are overrepresented.
>>>>>>>>
>>>>>>>>
>>>>>>>> Ok so let's say me my microRNA of interest is mir-A.
>>>>>>>>
>>>>>>>> Step1: based on my favorite prediction algorithm i have managed to
>>>>>>>> get a
>>>>>>>> list of genes targeted by mir-A. The genes are ensembl transcripts
>>>>>>>> and as i
>>>>>>>> said before miR-A can target several times the same transcript (at
>>>>>>>> different
>>>>>>>> location) so i need to account for this.
>>>>>>>>
>>>>>>>> miR-A targets ->
>>>>>>>> ENST001,ENST001,ENST001,ENST0025,ENST089,ENST099,ENST0099......) up
>>>>>>>> to 300
>>>>>>>> different transcripts.
>>>>>>>
>>>>>>> I don't get why you'd want to have the same transcript multiple
>>>>>>> times
>>>>>>> as a target for the miRNA -- if the miRNA targets the same
>>>>>>> transcript
>>>>>>> in two different locations, you then want to double count the GO
>>>>>>> terms
>>>>>>> associated with that transcript?
>>>>>>
>>>>>> That's correct. The idea behind that is that a transcript targeted at
>>>>>> different locations is more "likely to be twice targeted" and
>>>>>> therefore
>>>>>> GO term associated to this transcript have to be replicated. This
>>>>>> sound
>>>>>> good to me but i don not expect that you agree on that.
>>>>>>
>>>>>>
>>>>>> i have managed to get all GO ids with a small function. Basically you
>>>>>> input one transcript id in a loop
>>>>>>
>>>>>> l = length(genes) # list of all ensembl transcripts
>>>>>> for (l in 1:l)
>>>>>> {
>>>>>> goid[l] <- getgoids("ENST...")
>>>>>>
>>>>>> }
>>>>>> getgoids <- function (id) {
>>>>>> getBM(attributes=c(
>>>>>> 'go_biological_process_id',
>>>>>> 'go_biological_process_linkage_type',
>>>>>> 'go_cellular_component_id',
>>>>>> 'go_cellular_component_linkage_type',
>>>>>> 'go_molecular_function_id',
>>>>>> 'go_molecular_function_linkage_type')
>>>>>> ,filters="ensembl_transcript_id", values=id, mart=mart)
>>>>>> }
>>>>>>
>>>>>> I agree wioth you that i might need to add the transcript_id to be
>>>>>> able
>>>>>> to use for GoStats mapping between transcripts and GO ids.
>>>>>>
>>>>>>
>>>>>> Now i want to use that as the univere set for GoStats and do
>>>>>> hyperG to
>>>>>> compare with the GO for a specific microRNA.
>>>>>>
>>>>>> I guess :
>>>>>>
>>>>>> goframeData = data.frame(frame$go_id, frame$Evidence, frame$gene_id)
>>>>>> #list of all GOids from all transcripts targeted by all microRNA
>>>>>>
>>>>>> goFrame = GOFrame(goframeData, organism = "Homo sapiens")
>>>>>> goAllFrame = GOAllFrame(goFrame) #Geneid to ALL go id mapping
>>>>>>
>>>>>>
>>>>>> In the GSEAGOHyperGParams function below can you correct me ?
>>>>>> geneSetCollection = List of all go ids off all transcripts
>>>>>> targetted by
>>>>>> all microRNA
>>>>>> single_mir_transcript_ids = list of ENSEMBl transcripts ids
>>>>>> targeted by
>>>>>> a specific microRNA
>>>>>> univerGeneIds: list of transcript to Go mapping
>>>>>> Is this correc t?
>>>>>>
>>>>>>
>>>>>> gsc <- GeneSetCollection(goAllFrame, setType = GOCollection())
>>>>>> params <- GSEAGOHyperGParams(name = "My Custom GSEA based annot
>>>>>> Params",geneSetCollection = gsc, geneIds =
>>>>>> single_mir_transcripts_ids,
>>>>>> universeGeneIds = universe,ontology = "BP", pvalueCutoff = 0.05,
>>>>>> conditional = FALSE,testDirection = "over")
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Somehow that seems wrong to me -- if the "hit count" of the miRNA to
>>>>>>> the transcript is important to you, one thing you can do is store
>>>>>>> your
>>>>>>> miR-A vector as its "table()" so the names will the the transcripts,
>>>>>>> and the values will be the number of hits.
>>>>>>>
>>>>>>>> I use biomart to get the corresponding GoIds for these transcripts
>>>>>>>>
>>>>>>>> ....
>>>>>>>> #Select mart database
>>>>>>>> mart<- useMart("ensembl", dataset="hsapiens_gene_ensembl")
>>>>>>>>
>>>>>>>> #Get go for a specific transcript
>>>>>>>> # First problem as Biomart will not return twice GoTerms for
>>>>>>>> duplicated
>>>>>>>> transcripts. The example below show that for transcript
>>>>>>>> c("ENST00000347770","ENST00000347770") i get the same goTerms than
>>>>>>>> for
>>>>>>>> transcript c("ENST00000347770").
>>>>>>>> # As i said before a microRNA can target several times the same
>>>>>>>> microRNA so
>>>>>>>> twice the number of goterms associated to this particular microRNA.
>>>>>>>> Can we
>>>>>>>> force biomart to return redundant GoTerms ????
>>>>>>>
>>>>>>> I'm actually still not sure what you want to do, but if you
>>>>>>> follow my
>>>>>>> advice above, you can manipulate the data.frame you get from
>>>>>>> getBM to
>>>>>>> replicate rows (or whatever you're trying to do).
>>>>>>>
>>>>>>> You will also want to add "ensembl_transcript_id" to your vector of
>>>>>>> attributes so you can reassociate the rows in the table that is
>>>>>>> returned to you with your original ensembl transcripts you are
>>>>>>> querying for, eg:
>>>>>>>
>>>>>>> R> gomir<- getBM(attributes=c('ensembl_transcript_id', 'go..', ...),
>>>>>>> filters='ensemble_transcript_id', values=c("ENST..."), mart=mart)
>>>>>>>
>>>>>>> Hope that helps,
>>>>>>> -steve
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list