[BioC] GoStats and microRNA pipeline using Biomart

Iain Gallagher iaingallagher at btopenworld.com
Thu Mar 31 17:31:55 CEST 2011


Hi David

I'm not sure that you need to / should normalise based on UTR length.

The mechanism of miRNA target repression is mostly thought to results from miRNA binding to the 3'UTR and guiding the target mRNA to an AGO containing processing unit for destruction. This may not be the only mechanism. In any case the targeting of a mRNA by a miRNA is dependent on complementarity between the seed region of the miRNA (bases 2-7 or 8) and the the 3'UTR. There's also some evidence that complementarity at bases 12/13 - 16/17 of the miRNA are also important in binding. Furthermore the position of the binding site in the UTR is also important (see the Bartel groups TargetScan papers for more info here - Friedman et al I think and Grimson et al) so some binding sites may be spurious.

Thus the targeting of a given mRNA by a miRNA is not really a function of UTR length per se but UTR sequence. Presumably the algorithm you have used to select miRNA targets takes into account the sequence (i.e. you know that there is complementarity already). 

From your original email you want to correct for the expectation that a given miRNA may target the UTR multiple times when you analyse for GO categories. Would it not be better to weight multiply targeted genes so that they contribute more to the enrichment analysis. So how would one go about this?

Perhaps you need to generate the list of genes that are targeted by miRNA-A, examine this list (using e.g. GOstats) for enriched GO categories, pull out the genes associated with each category (geneIdsByCategory in the Category lib), examine the categories for genes which are multiply targeted and then weight the p-value for a given GO term appropriately (perhaps a simple multiplication by the number of times genes are targeted) and re-rank the p-values.

Bear in mind though that just because a UTR is targeted multiple times by a given miRNA (and that predictions differ widely across algorithms still) that particular mRNA may be no more 'repressed' than if it was targeted once.

Anyway - my tuppence worth!

Hope it's helpful to you

best

i

--- On Thu, 31/3/11, James F. Reid <james.reid at ifom-ieo-campus.it> wrote:

> From: James F. Reid <james.reid at ifom-ieo-campus.it>
> Subject: Re: [BioC] GoStats and microRNA pipeline using Biomart
> To: "David martin" <vilanew at gmail.com>
> Cc: bioconductor at stat.math.ethz.ch
> Date: Thursday, 31 March, 2011, 9:10
> Hi David,
> 
> On 03/30/2011 08:31 PM, David martin wrote:
>  > Yes absolutly. A few ensembl releases ago UTR tend to
> be smaller but
>  > this is getting better now. How would you normalize
> that based on 
> length ?
> 
> I'm afraid that I don't have a simple answer to this it
> would need 
> thinking out especially wrt to your GO enrichment
> analysis.
> Any ideas from the members of the list?
> 
> Best,
> J.
> 
> > On 03/30/2011 07:00 PM, James F. Reid wrote:
> >> Hi David,
> >>
> >> I understand your reasoning for counting the
> number of miRNA binding
> >> sites with the 3' UTR of a predicted target, you
> are trying to include
> >> the 'combinatorial' effect of miRNA targeting.
> >> I would try to include the length of any UTR
> however (some kind of
> >> normalization if you wish) since the longer the
> UTR the more chances are
> >> that miRNA will bind.
> >> Does this make sense?
> >>
> >> Best,
> >> J.
> >>
> >> On 03/30/2011 05:23 PM, David martin wrote:
> >>> On 03/30/2011 04:56 PM, Steve Lianoglou
> wrote:
> >>>> Hi,
> >>>>
> >>>> On Wed, Mar 30, 2011 at 9:43 AM, David
> >>>> martin<vilanew at gmail.com>
> wrote:
> >>>>> Hi,
> >>>>> I open this new discussion so not to
> confuse with the previous one.
> >>>>>
> >>>>> The objective here is to look for
> overrepresented GoTerms from
> >>>>> microRNA
> >>>>> targets. One microRNA can have several
> targets (genes) and one single
> >>>>> gene
> >>>>> can be targeted by several microRNAs.
> The assumption is to check for a
> >>>>> specific microRNAs which GoTerms are
> overrepresented.
> >>>>>
> >>>>>
> >>>>> Ok so let's say me my microRNA of
> interest is mir-A.
> >>>>>
> >>>>> Step1: based on my favorite prediction
> algorithm i have managed to
> >>>>> get a
> >>>>> list of genes targeted by mir-A. The
> genes are ensembl transcripts
> >>>>> and as i
> >>>>> said before miR-A can target several
> times the same transcript (at
> >>>>> different
> >>>>> location) so i need to account for
> this.
> >>>>>
> >>>>> miR-A targets ->
> >>>>>
> ENST001,ENST001,ENST001,ENST0025,ENST089,ENST099,ENST0099......)
> up
> >>>>> to 300
> >>>>> different transcripts.
> >>>>
> >>>> I don't get why you'd want to have the
> same transcript multiple times
> >>>> as a target for the miRNA -- if the miRNA
> targets the same transcript
> >>>> in two different locations, you then want
> to double count the GO terms
> >>>> associated with that transcript?
> >>>
> >>> That's correct. The idea behind that is that a
> transcript targeted at
> >>> different locations is more "likely to be
> twice targeted" and therefore
> >>> GO term associated to this transcript have to
> be replicated. This sound
> >>> good to me but i don not expect that you agree
> on that.
> >>>
> >>>
> >>> i have managed to get all GO ids with a small
> function. Basically you
> >>> input one transcript id in a loop
> >>>
> >>> l = length(genes) # list of all ensembl
> transcripts
> >>> for (l in 1:l)
> >>> {
> >>> goid[l] <- getgoids("ENST...")
> >>>
> >>> }
> >>> getgoids <- function (id) {
> >>> getBM(attributes=c(
> >>> 'go_biological_process_id',
> >>> 'go_biological_process_linkage_type',
> >>> 'go_cellular_component_id',
> >>> 'go_cellular_component_linkage_type',
> >>> 'go_molecular_function_id',
> >>> 'go_molecular_function_linkage_type')
> >>> ,filters="ensembl_transcript_id", values=id,
> mart=mart)
> >>> }
> >>>
> >>> I agree wioth you that i might need to add the
> transcript_id to be able
> >>> to use for GoStats mapping between transcripts
> and GO ids.
> >>>
> >>>
> >>> Now i want to use that as the univere set for
> GoStats and do hyperG to
> >>> compare with the GO for a specific microRNA.
> >>>
> >>> I guess :
> >>>
> >>> goframeData = data.frame(frame$go_id,
> frame$Evidence, frame$gene_id)
> >>> #list of all GOids from all transcripts
> targeted by all microRNA
> >>>
> >>> goFrame = GOFrame(goframeData, organism =
> "Homo sapiens")
> >>> goAllFrame = GOAllFrame(goFrame) #Geneid to
> ALL go id mapping
> >>>
> >>>
> >>> In the GSEAGOHyperGParams function below can
> you correct me ?
> >>> geneSetCollection = List of all go ids off all
> transcripts targetted by
> >>> all microRNA
> >>> single_mir_transcript_ids = list of ENSEMBl
> transcripts ids targeted by
> >>> a specific microRNA
> >>> univerGeneIds: list of transcript to Go
> mapping
> >>> Is this correc t?
> >>>
> >>>
> >>> gsc <- GeneSetCollection(goAllFrame,
> setType = GOCollection())
> >>> params <- GSEAGOHyperGParams(name = "My
> Custom GSEA based annot
> >>> Params",geneSetCollection = gsc, geneIds =
> single_mir_transcripts_ids,
> >>> universeGeneIds = universe,ontology = "BP",
> pvalueCutoff = 0.05,
> >>> conditional = FALSE,testDirection = "over")
> >>>
> >>>
> >>>>
> >>>> Somehow that seems wrong to me -- if the
> "hit count" of the miRNA to
> >>>> the transcript is important to you, one
> thing you can do is store your
> >>>> miR-A vector as its "table()" so the names
> will the the transcripts,
> >>>> and the values will be the number of
> hits.
> >>>>
> >>>>> I use biomart to get the corresponding
> GoIds for these transcripts
> >>>>>
> >>>>> ....
> >>>>> #Select mart database
> >>>>> mart<- useMart("ensembl",
> dataset="hsapiens_gene_ensembl")
> >>>>>
> >>>>> #Get go for a specific transcript
> >>>>> # First problem as Biomart will not
> return twice GoTerms for
> >>>>> duplicated
> >>>>> transcripts. The example below show
> that for transcript
> >>>>> c("ENST00000347770","ENST00000347770")
> i get the same goTerms than for
> >>>>> transcript c("ENST00000347770").
> >>>>> # As i said before a microRNA can
> target several times the same
> >>>>> microRNA so
> >>>>> twice the number of goterms associated
> to this particular microRNA.
> >>>>> Can we
> >>>>> force biomart to return redundant
> GoTerms ????
> >>>>
> >>>> I'm actually still not sure what you want
> to do, but if you follow my
> >>>> advice above, you can manipulate the
> data.frame you get from getBM to
> >>>> replicate rows (or whatever you're trying
> to do).
> >>>>
> >>>> You will also want to add
> "ensembl_transcript_id" to your vector of
> >>>> attributes so you can reassociate the rows
> in the table that is
> >>>> returned to you with your original ensembl
> transcripts you are
> >>>> querying for, eg:
> >>>>
> >>>> R> gomir<-
> getBM(attributes=c('ensembl_transcript_id', 'go..', ...),
> >>>> filters='ensemble_transcript_id',
> values=c("ENST..."), mart=mart)
> >>>>
> >>>> Hope that helps,
> >>>> -steve
> >>>>
> >>>
> >>>
> _______________________________________________
> >>> Bioconductor mailing list
> >>> Bioconductor at r-project.org
> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>> Search the archives:
> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>>
> >>
> >> _______________________________________________
> >> Bioconductor mailing list
> >> Bioconductor at r-project.org
> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> Search the archives:
> >> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
> > http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>




More information about the Bioconductor mailing list