[BioC] promoter prediction

Paul Shannon pshannon at fhcrc.org
Mon Nov 19 05:09:11 CET 2012


Hi Jing,

I am including the Bioconductor email list so that we will have a record of your question, and the answers we arrive at.
On Nov 18, 2012, at 5:32 PM, Jing Huang wrote:

> Hi Paul,
> 
> I am wondering if this would be doable. I have a few genes that form a
> complex. They have been seen over expressed in a variety of tumors
> simultaneously.
> 
Do you hypothesize that their joint over-expression suggests that they have common regulators?

> The package that you generated seems to fit the scenario to predict the
> match between known transcription factor and  genes. I would like to
> predict the transcription factors  that are unknown.

One good approach here would be to find candidate regulatory regions for each of the members of your complex.  Bioc now has a getPromoterSeq method, demonstrated at http://bioconductor.org/help/workflows/gene-regulation-tfbs/.  The rGADEM package finds motifs de novo when given a number of sequences, but this can be an expensive and inconclusive search when your sequences are long, and if your genes are few. 

The ENCODE project, and John Stam's group at UW in particular, have produced a lot of new data, including DNase1 hypersensitivity regions and footprints, and H3K4me methylation profiles, and transcription factor binding sites.  The can narrow your search considerably.  In short, we now know much more than we used to about what and where the regulatory regions proximal to a gene seem to be.   We have just begun prototyping a means to provide easy access in Bioconductor to these kinds of data.

Once you have some candidate transcription factor binding sequences, the MotIV package (and the external program 'tomtom') can match them against know motifs in MotifDb, often identifying transcription factor candidates.

If you could clarify your question a bit, provide an example -- anonymizing the genes in your complex if need be -- we can try and find specific techniques for you to use.

Please reply 'on-list' so that our discussion can be archived, and so that others with advice can chip in.


 - Paul


> 
> Is there anyway it is doable?
> 
> Many many thanks
> 
> Jing
> On 10/8/12 8:38 PM, "Paul Shannon" <pshannon at fhcrc.org> wrote:
> 
>> Hi Jing,
>> 
>> This took WAY too long.
>> 
>> But it is at last ready.  Could you take a look?  Give me comments?
>> 
>>  http://www.bioconductor.org/help/workflows/gene-regulation-tfbs/
>> 
>> Thanks!
>> 
>> - Paul
>> 
>> On Jul 5, 2012, at 3:58 PM, Jing Huang wrote:
>> 
>>> No hurry!
>>> 
>>> Jing
>>> 
>>> -----Original Message-----
>>> From: Paul Shannon [mailto:pshannon at fhcrc.org]
>>> Sent: Thursday, July 05, 2012 3:43 PM
>>> To: Jing Huang
>>> Cc: Paul Shannon
>>> Subject: Re: promoter prediction
>>> 
>>> Hi Jing,
>>> 
>>> Should have something ready by the end of next week.
>>> 
>>> Sorry it's taken so long!
>>> 
>>> - Paul
>>> 
>>> On Jul 5, 2012, at 3:41 PM, Jing Huang wrote:
>>> 
>>>> Hi Paul,
>>>> 
>>>> Are you still going to write the package for promoter prediction? I
>>>> have been very busy with bench work and not been able to study this.
>>>> 
>>>> It will be nice if you could write the package and present at BioC12
>>>> meeting by the end of this month.
>>>> 
>>>> Jing
>>>> 
>>>> -----Original Message-----
>>>> From: Paul Shannon [mailto:pshannon at fhcrc.org]
>>>> Sent: Tuesday, June 12, 2012 12:53 PM
>>>> To: Jing Huang
>>>> Cc: Paul Shannon
>>>> Subject: Re: promoter prediction
>>>> 
>>>> Cool!   
>>>> 
>>>> On Jun 12, 2012, at 12:46 PM, Jing Huang wrote:
>>>> 
>>>>> Figured it out on this one.
>>>>> 
>>>>> Jing
>>>>> 
>>>>> On 6/12/12 11:51 AM, "Paul Shannon" <pshannon at fhcrc.org> wrote:
>>>>> 
>>>>>> It's an odd error.
>>>>>> 
>>>>>> Try this:
>>>>>> 
>>>>>> ?load
>>>>>> ?save
>>>>>> 
>>>>>> Once you understand them, ask yourself, hmmm, what could be wrong
>>>>>> here?
>>>>>> 
>>>>>> (I am trying to teach you to fish, rather than just GIVE you fish!)
>>>>>> 
>>>>>> - Paul
>>>>>> 
>>>>>> On Jun 12, 2012, at 11:48 AM, Jing Huang wrote:
>>>>>> 
>>>>>>> Hi Paul,
>>>>>>> 
>>>>>>> What does this mean?
>>>>>>> 
>>>>>>>> if (!exists ('e2f3'))
>>>>>>> +   load ('symbolsToGeneIDs.RData', envir=.GlobalEnv)
>>>>>>> Error: segfault from C stack overflow
>>>>>>> 
>>>>>>> Many Thanks
>>>>>>> 
>>>>>>> Jing
>>>>>>> 
>>>>>>> From: Paul Shannon <pshannon at fhcrc.org>
>>>>>>> To: Jing Huang <huangji at ohsu.edu>
>>>>>>> Cc: Paul Shannon <pshannon at fhcrc.org>
>>>>>>> Subject: Re: promoter prediction
>>>>>>> 
>>>>>>> Hi Jing,
>>>>>>> 
>>>>>>> Learning to install software will be a good thing to learn.  It's a
>>>>>>> basic part of any bioinformatician's work!
>>>>>>> 
>>>>>>> If you look at this page:
>>>>>>> 
>>>>>>> http://meme.sdsc.edu/meme/meme-download.html
>>>>>>> 
>>>>>>> You will see a link to 'installation instructions'.  That would be a
>>>>>>> good place to begin.
>>>>>>> 
>>>>>>> I apologize, I forgot to include this file.  Put it in your working
>>>>>>> directory:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Treat each puzzle you encounter as an opportunity to learn!
>>>>>>> 
>>>>>>> - Paul
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Jun 12, 2012, at 9:08 AM, Jing Huang wrote:
>>>>>>> 
>>>>>>>> HI Paul,
>>>>>>>> 
>>>>>>>> I am having trouble to down load MEME. I guess I am not sure what
>>>>>>>> to
>>>>>>> down load. In order to run MEME, It seems that they require Perl or
>>>>>>> Python software? I don't have knowledge on those.
>>>>>>>> 
>>>>>>>> I have tried to run your scripts and run into errors:
>>>>>>>> 
>>>>>>>>> if (!exists ('e2f3'))
>>>>>>>> +   load ('symbolsToGeneIDs.RData', envir=.GlobalEnv)
>>>>>>>> Error in readChar(con, 5L, useBytes = TRUE) : cannot open the
>>>>>>> connection
>>>>>>>> In addition: Warning message:
>>>>>>>> In readChar(con, 5L, useBytes = TRUE) :
>>>>>>>> cannot open compressed file 'symbolsToGeneIDs.RData', probable
>>>>>>> reason 'No such file or directory'
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Not sure what this means. I am wondering what else do my computer
>>>>>>> need to be installed.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Many thanks
>>>>>>>> 
>>>>>>>> Jing
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> From: Paul Shannon <pshannon at fhcrc.org>
>>>>>>>> To: Jing Huang <huangji at ohsu.edu>
>>>>>>>> Cc: Paul Shannon <pshannon at fhcrc.org>
>>>>>>>> Subject: Re: promoter prediction
>>>>>>>> 
>>>>>>>> Hi Jing,
>>>>>>>> 
>>>>>>>> My boss has some other plans for me this week :} so I am sending
>>>>>>>> this
>>>>>>> to you tonight, giving you (I think) plenty to work on, to study,
>>>>>>> and to
>>>>>>> comprehend. 
>>>>>>>> 
>>>>>>>> What I include below is all you need for finding enriched motifs in
>>>>>>> the promoters of your genes.
>>>>>>>> 
>>>>>>>> What is NOT included is finding out the transcription factors which
>>>>>>> match those motifs.  Learn all of what's here, then you will be
>>>>>>> ready
>>>>>>> for MotIV and my new MotifDb -- which should be ready to use by the
>>>>>>> end
>>>>>>> of the week.
>>>>>>>> 
>>>>>>>> There is one file attached, a somewhat improvised R script.  It
>>>>>>>> runs,
>>>>>>> but it is not in a style you should emulate.  But there's lots to
>>>>>>> learn
>>>>>>> if you study it, line by line, until everything makes complete
>>>>>>> sense to
>>>>>>> you.  Please do that!
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Here's how to run the script
>>>>>>>> 1) Install all the libraries mentioned in the file.  For instance,:
>>>>>>>>    biocLite (c ('org.Hs.eg.db', 'BSgenome.Hsapiens.UCSC.hg19',
>>>>>>> 'GenomicFeatures', 'TxDb.Hsapiens.UCSC.hg19.knownGene'))
>>>>>>>> 2) install meme; fix the path to meme in the script so that it
>>>>>>> matches where the meme executable is on your computer
>>>>>>>> 3) source ('go.R'); run ('redo')
>>>>>>>> 
>>>>>>>> meme takes maybe 20 minutes to run on my laptop.
>>>>>>>> 
>>>>>>>> Having found these motifs, the next step is to use tom-tom, or
>>>>>>> (better yet) Bioconductor package MotIV and my new MotifDb.
>>>>>>>> Be aware:  the pvalues of these enrichments is not very strong.
>>>>>>>> 
>>>>>>>> Please study the script, run meme, and get really familiar with it
>>>>>>> all.  Send me questions if you have them.  Then run MotIV with
>>>>>>> built-in
>>>>>>> jaspar matrices, comparing the enriched motifs meme found, to the
>>>>>>> jaspar
>>>>>>> matrices.  
>>>>>>>> 
>>>>>>>> - Paul
>>>>>>>> 
>>>>>>>> <PastedGraphic-1.png>
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Jun 8, 2012, at 2:48 PM, Jing Huang wrote:
>>>>>>>> 
>>>>>>>>> Hi Paul,
>>>>>>>>> 
>>>>>>>>> Here is the list but only to you. MCM2,MCM3,MCM4,MCM5,MCM6,
>>>>>>> MCM7,MCM8. The corresponding ENTREZ ID are,
>>>>>>> 4171,4172,4173,4174,4175,4176,84515.
>>>>>>>>> 
>>>>>>>>> I will play with the meme as your email suggested.
>>>>>>>>> 
>>>>>>>>> Have a nice weekend
>>>>>>>>> 
>>>>>>>>> Jing
>>>>>>>>> 
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Paul Shannon [mailto:pshannon at fhcrc.org]
>>>>>>>>> Sent: Friday, June 08, 2012 2:40 PM
>>>>>>>>> To: Jing Huang
>>>>>>>>> Cc: Paul Shannon
>>>>>>>>> Subject: Re: promoter prediction
>>>>>>>>> 
>>>>>>>>> Well, two promoters are not enough of a sample in which to find
>>>>>>> motif enrichments.  I'll dredge up an example dataset from
>>>>>>> elsewhere.
>>>>>>>>> 
>>>>>>>>> In preparation, you could install meme, and seeing if you can
>>>>>>>>> adapt
>>>>>>> the 'get.promoter' function I sent you, for arabidopsis, to human.
>>>>>>>>> 
>>>>>>>>> I will have a human demo ready mid-week next week.
>>>>>>>>> 
>>>>>>>>> - Paul
>>>>>>>>> 
>>>>>>>>> On Jun 8, 2012, at 2:36 PM, Jing Huang wrote:
>>>>>>>>> 
>>>>>>>>>> I don't remember what the inputs were. Somebody posted a question
>>>>>>> on the package to our mailing group and I saw it and played with a
>>>>>>> little bit.
>>>>>>>>>> 
>>>>>>>>>> The list of gene is confidential. How about I only give you two
>>>>>>>>>> of
>>>>>>> them MCM2 and MCM3. The correspond ENTREZ ID are 4171 and 4172.
>>>>>>>>>> 
>>>>>>>>>> I hope this is enough information.
>>>>>>>>>> 
>>>>>>>>>> Jing
>>>>>>>>>> 
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Paul Shannon [mailto:pshannon at fhcrc.org]
>>>>>>>>>> Sent: Friday, June 08, 2012 2:19 PM
>>>>>>>>>> To: Jing Huang
>>>>>>>>>> Cc: Paul Shannon
>>>>>>>>>> Subject: Re: promoter prediction
>>>>>>>>>> 
>>>>>>>>>> Hi Jing,
>>>>>>>>>> 
>>>>>>>>>> Do you know what inputs are used for the package you are trying
>>>>>>>>>> to
>>>>>>> remember?  I cannot think what it would be.
>>>>>>>>>> 
>>>>>>>>>> Also (I asked this before :}) do you have a list of specific
>>>>>>> co-regulated genes?  Are they confidential?  If not, please sent me
>>>>>>> that
>>>>>>> list.
>>>>>>>>>> 
>>>>>>>>>> - Paul
>>>>>>>>>> 
>>>>>>>>>> On Jun 8, 2012, at 2:16 PM, Jing Huang wrote:
>>>>>>>>>> 
>>>>>>>>>>> HI Paul,
>>>>>>>>>>> 
>>>>>>>>>>> I am still studying the a few packages related to predict the
>>>>>>> shared transcription factor and waiting for you for the new advanced
>>>>>>> package to be released.
>>>>>>>>>>> 
>>>>>>>>>>> There is a BIoC package that allows me to predict promoters. I
>>>>>>> have played with it but don't remember the name of the package. Do
>>>>>>> you
>>>>>>> know there is such package by any chance.
>>>>>>>>>>> 
>>>>>>>>>>> Many thanks
>>>>>>>>>>> 
>>>>>>>>>>> Jing
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> <PastedGraphic-1.png>
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 



More information about the Bioconductor mailing list