[BioC] GoStats and microRNA pipeline using Biomart

Wed Mar 30 17:23:56 CEST 2011

On 03/30/2011 04:56 PM, Steve Lianoglou wrote:
> Hi,
>
> On Wed, Mar 30, 2011 at 9:43 AM, David martin<vilanew at gmail.com>  wrote:
>> Hi,
>> I open this new discussion so not to confuse with the previous one.
>>
>> The objective here is to look for overrepresented GoTerms from microRNA
>> targets. One microRNA can have several targets (genes)  and one single gene
>> can be targeted by several microRNAs. The assumption is to check for a
>> specific microRNAs which GoTerms are overrepresented.
>>
>>
>> Ok so let's say me my microRNA of interest is mir-A.
>>
>> Step1: based on my favorite prediction algorithm i have managed to get a
>> list of genes targeted by mir-A. The genes are ensembl transcripts and as i
>> said before miR-A can target several times the same transcript (at different
>> location) so i need to account for this.
>>
>> miR-A targets ->
>> ENST001,ENST001,ENST001,ENST0025,ENST089,ENST099,ENST0099......) up to 300
>> different transcripts.
>
> I don't get why you'd want to have the same transcript multiple times
> as a target for the miRNA -- if the miRNA targets the same transcript
> in two different locations, you then want to double count the GO terms
> associated with that transcript?

That's correct. The idea behind that is that a transcript targeted at 
different locations is more "likely to be twice targeted" and therefore 
GO term  associated to this transcript have to be replicated. This sound 
good to me but i don not expect that you agree on that.

  i have managed to get all GO ids with a small function. Basically you 
input one transcript id in a loop

l = length(genes) # list of all ensembl transcripts
for (l in 1:l)
{
goid[l] <- getgoids("ENST...")

}
getgoids <- function (id) {
   getBM(attributes=c(
           'go_biological_process_id',
           'go_biological_process_linkage_type',
           'go_cellular_component_id',
           'go_cellular_component_linkage_type',
           'go_molecular_function_id',
           'go_molecular_function_linkage_type')
         ,filters="ensembl_transcript_id",  values=id,  mart=mart)
}

I agree wioth you that i might need to add the transcript_id to be able 
to use for GoStats mapping between transcripts and GO ids.

Now i want to use that as the univere set for GoStats and do hyperG to 
compare with the GO for a specific microRNA.

I guess :

goframeData = data.frame(frame$go_id, frame$Evidence, frame$gene_id) 
#list of all GOids from all transcripts targeted by all microRNA

goFrame = GOFrame(goframeData, organism = "Homo sapiens")
goAllFrame = GOAllFrame(goFrame) #Geneid to ALL go id mapping

In the GSEAGOHyperGParams function below can you correct me ?
geneSetCollection = List of all go ids off all transcripts targetted by 
all microRNA
single_mir_transcript_ids = list of ENSEMBl transcripts ids targeted by 
a specific microRNA
univerGeneIds: list of transcript to Go mapping
Is this correc t?

gsc <- GeneSetCollection(goAllFrame, setType = GOCollection())
params <- GSEAGOHyperGParams(name = "My Custom GSEA based annot 
Params",geneSetCollection = gsc, geneIds =  single_mir_transcripts_ids, 
universeGeneIds = universe,ontology = "BP", pvalueCutoff = 0.05, 
conditional = FALSE,testDirection = "over")

>
> Somehow that seems wrong to me -- if the "hit count" of the miRNA to
> the transcript is important to you, one thing you can do is store your
> miR-A vector as its "table()" so the names will the the transcripts,
> and the values will be the number of hits.
>
>> I use biomart to get the corresponding GoIds for these transcripts
>>
>> ....
>> #Select mart database
>> mart<- useMart("ensembl", dataset="hsapiens_gene_ensembl")
>>
>> #Get go for a specific transcript
>> # First problem as Biomart will not return twice GoTerms for duplicated
>> transcripts. The example below show that for transcript
>> c("ENST00000347770","ENST00000347770") i get the same goTerms than for
>> transcript c("ENST00000347770").
>> # As i said before a microRNA can target several times the same microRNA so
>> twice the number of goterms associated to this particular microRNA. Can we
>> force biomart to return redundant GoTerms ????
>
> I'm actually still not sure what you want to do, but if you follow my
> advice above, you can manipulate the data.frame you get from getBM to
> replicate rows (or whatever you're trying to do).
>
> You will also want to add "ensembl_transcript_id" to your vector of
> attributes so you can reassociate the rows in the table that is
> returned to you with your original ensembl transcripts you are
> querying for, eg:
>
> R>  gomir<- getBM(attributes=c('ensembl_transcript_id', 'go..', ...),
>      filters='ensemble_transcript_id', values=c("ENST..."), mart=mart)
>
> Hope that helps,
> -steve
>