[BioC] promoter prediction

Mon Nov 19 06:02:50 CET 2012

Many thanks Paul.

Let me first following your advices and do some investigations during
holiday. Then, I will be able to define the question more specific.

Jing 

On 11/18/12 8:09 PM, "Paul Shannon" <pshannon at fhcrc.org> wrote:

>Hi Jing,
>
>I am including the Bioconductor email list so that we will have a record
>of your question, and the answers we arrive at.
>On Nov 18, 2012, at 5:32 PM, Jing Huang wrote:
>
>> Hi Paul,
>> 
>> I am wondering if this would be doable. I have a few genes that form a
>> complex. They have been seen over expressed in a variety of tumors
>> simultaneously.
>> 
>Do you hypothesize that their joint over-expression suggests that they
>have common regulators?
>
>> The package that you generated seems to fit the scenario to predict the
>> match between known transcription factor and  genes. I would like to
>> predict the transcription factors  that are unknown.
>
>One good approach here would be to find candidate regulatory regions for
>each of the members of your complex.  Bioc now has a getPromoterSeq
>method, demonstrated at
>http://bioconductor.org/help/workflows/gene-regulation-tfbs/.  The rGADEM
>package finds motifs de novo when given a number of sequences, but this
>can be an expensive and inconclusive search when your sequences are long,
>and if your genes are few.
>
>The ENCODE project, and John Stam's group at UW in particular, have
>produced a lot of new data, including DNase1 hypersensitivity regions and
>footprints, and H3K4me methylation profiles, and transcription factor
>binding sites.  The can narrow your search considerably.  In short, we
>now know much more than we used to about what and where the regulatory
>regions proximal to a gene seem to be.   We have just begun prototyping a
>means to provide easy access in Bioconductor to these kinds of data.
>
>Once you have some candidate transcription factor binding sequences, the
>MotIV package (and the external program 'tomtom') can match them against
>know motifs in MotifDb, often identifying transcription factor candidates.
>
>If you could clarify your question a bit, provide an example --
>anonymizing the genes in your complex if need be -- we can try and find
>specific techniques for you to use.
>
>Please reply 'on-list' so that our discussion can be archived, and so
>that others with advice can chip in.
>
>
> - Paul
>
>
>> 
>> Is there anyway it is doable?
>> 
>> Many many thanks
>> 
>> Jing
>> On 10/8/12 8:38 PM, "Paul Shannon" <pshannon at fhcrc.org> wrote:
>> 
>>> Hi Jing,
>>> 
>>> This took WAY too long.
>>> 
>>> But it is at last ready.  Could you take a look?  Give me comments?
>>> 
>>>  http://www.bioconductor.org/help/workflows/gene-regulation-tfbs/
>>> 
>>> Thanks!
>>> 
>>> - Paul
>>> 
>>> On Jul 5, 2012, at 3:58 PM, Jing Huang wrote:
>>> 
>>>> No hurry!
>>>> 
>>>> Jing
>>>> 
>>>> -----Original Message-----
>>>> From: Paul Shannon [mailto:pshannon at fhcrc.org]
>>>> Sent: Thursday, July 05, 2012 3:43 PM
>>>> To: Jing Huang
>>>> Cc: Paul Shannon
>>>> Subject: Re: promoter prediction
>>>> 
>>>> Hi Jing,
>>>> 
>>>> Should have something ready by the end of next week.
>>>> 
>>>> Sorry it's taken so long!
>>>> 
>>>> - Paul
>>>> 
>>>> On Jul 5, 2012, at 3:41 PM, Jing Huang wrote:
>>>> 
>>>>> Hi Paul,
>>>>> 
>>>>> Are you still going to write the package for promoter prediction? I
>>>>> have been very busy with bench work and not been able to study this.
>>>>> 
>>>>> It will be nice if you could write the package and present at BioC12
>>>>> meeting by the end of this month.
>>>>> 
>>>>> Jing
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Paul Shannon [mailto:pshannon at fhcrc.org]
>>>>> Sent: Tuesday, June 12, 2012 12:53 PM
>>>>> To: Jing Huang
>>>>> Cc: Paul Shannon
>>>>> Subject: Re: promoter prediction
>>>>> 
>>>>> Cool!   
>>>>> 
>>>>> On Jun 12, 2012, at 12:46 PM, Jing Huang wrote:
>>>>> 
>>>>>> Figured it out on this one.
>>>>>> 
>>>>>> Jing
>>>>>> 
>>>>>> On 6/12/12 11:51 AM, "Paul Shannon" <pshannon at fhcrc.org> wrote:
>>>>>> 
>>>>>>> It's an odd error.
>>>>>>> 
>>>>>>> Try this:
>>>>>>> 
>>>>>>> ?load
>>>>>>> ?save
>>>>>>> 
>>>>>>> Once you understand them, ask yourself, hmmm, what could be wrong
>>>>>>> here?
>>>>>>> 
>>>>>>> (I am trying to teach you to fish, rather than just GIVE you fish!)
>>>>>>> 
>>>>>>> - Paul
>>>>>>> 
>>>>>>> On Jun 12, 2012, at 11:48 AM, Jing Huang wrote:
>>>>>>> 
>>>>>>>> Hi Paul,
>>>>>>>> 
>>>>>>>> What does this mean?
>>>>>>>> 
>>>>>>>>> if (!exists ('e2f3'))
>>>>>>>> +   load ('symbolsToGeneIDs.RData', envir=.GlobalEnv)
>>>>>>>> Error: segfault from C stack overflow
>>>>>>>> 
>>>>>>>> Many Thanks
>>>>>>>> 
>>>>>>>> Jing
>>>>>>>> 
>>>>>>>> From: Paul Shannon <pshannon at fhcrc.org>
>>>>>>>> To: Jing Huang <huangji at ohsu.edu>
>>>>>>>> Cc: Paul Shannon <pshannon at fhcrc.org>
>>>>>>>> Subject: Re: promoter prediction
>>>>>>>> 
>>>>>>>> Hi Jing,
>>>>>>>> 
>>>>>>>> Learning to install software will be a good thing to learn.  It's
>>>>>>>>a
>>>>>>>> basic part of any bioinformatician's work!
>>>>>>>> 
>>>>>>>> If you look at this page:
>>>>>>>> 
>>>>>>>> http://meme.sdsc.edu/meme/meme-download.html
>>>>>>>> 
>>>>>>>> You will see a link to 'installation instructions'.  That would
>>>>>>>>be a
>>>>>>>> good place to begin.
>>>>>>>> 
>>>>>>>> I apologize, I forgot to include this file.  Put it in your
>>>>>>>>working
>>>>>>>> directory:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Treat each puzzle you encounter as an opportunity to learn!
>>>>>>>> 
>>>>>>>> - Paul
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Jun 12, 2012, at 9:08 AM, Jing Huang wrote:
>>>>>>>> 
>>>>>>>>> HI Paul,
>>>>>>>>> 
>>>>>>>>> I am having trouble to down load MEME. I guess I am not sure what
>>>>>>>>> to
>>>>>>>> down load. In order to run MEME, It seems that they require Perl
>>>>>>>>or
>>>>>>>> Python software? I don't have knowledge on those.
>>>>>>>>> 
>>>>>>>>> I have tried to run your scripts and run into errors:
>>>>>>>>> 
>>>>>>>>>> if (!exists ('e2f3'))
>>>>>>>>> +   load ('symbolsToGeneIDs.RData', envir=.GlobalEnv)
>>>>>>>>> Error in readChar(con, 5L, useBytes = TRUE) : cannot open the
>>>>>>>> connection
>>>>>>>>> In addition: Warning message:
>>>>>>>>> In readChar(con, 5L, useBytes = TRUE) :
>>>>>>>>> cannot open compressed file 'symbolsToGeneIDs.RData', probable
>>>>>>>> reason 'No such file or directory'
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Not sure what this means. I am wondering what else do my computer
>>>>>>>> need to be installed.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Many thanks
>>>>>>>>> 
>>>>>>>>> Jing
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> From: Paul Shannon <pshannon at fhcrc.org>
>>>>>>>>> To: Jing Huang <huangji at ohsu.edu>
>>>>>>>>> Cc: Paul Shannon <pshannon at fhcrc.org>
>>>>>>>>> Subject: Re: promoter prediction
>>>>>>>>> 
>>>>>>>>> Hi Jing,
>>>>>>>>> 
>>>>>>>>> My boss has some other plans for me this week :} so I am sending
>>>>>>>>> this
>>>>>>>> to you tonight, giving you (I think) plenty to work on, to study,
>>>>>>>> and to
>>>>>>>> comprehend.
>>>>>>>>> 
>>>>>>>>> What I include below is all you need for finding enriched motifs
>>>>>>>>>in
>>>>>>>> the promoters of your genes.
>>>>>>>>> 
>>>>>>>>> What is NOT included is finding out the transcription factors
>>>>>>>>>which
>>>>>>>> match those motifs.  Learn all of what's here, then you will be
>>>>>>>> ready
>>>>>>>> for MotIV and my new MotifDb -- which should be ready to use by
>>>>>>>>the
>>>>>>>> end
>>>>>>>> of the week.
>>>>>>>>> 
>>>>>>>>> There is one file attached, a somewhat improvised R script.  It
>>>>>>>>> runs,
>>>>>>>> but it is not in a style you should emulate.  But there's lots to
>>>>>>>> learn
>>>>>>>> if you study it, line by line, until everything makes complete
>>>>>>>> sense to
>>>>>>>> you.  Please do that!
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Here's how to run the script
>>>>>>>>> 1) Install all the libraries mentioned in the file.  For
>>>>>>>>>instance,:
>>>>>>>>>    biocLite (c ('org.Hs.eg.db', 'BSgenome.Hsapiens.UCSC.hg19',
>>>>>>>> 'GenomicFeatures', 'TxDb.Hsapiens.UCSC.hg19.knownGene'))
>>>>>>>>> 2) install meme; fix the path to meme in the script so that it
>>>>>>>> matches where the meme executable is on your computer
>>>>>>>>> 3) source ('go.R'); run ('redo')
>>>>>>>>> 
>>>>>>>>> meme takes maybe 20 minutes to run on my laptop.
>>>>>>>>> 
>>>>>>>>> Having found these motifs, the next step is to use tom-tom, or
>>>>>>>> (better yet) Bioconductor package MotIV and my new MotifDb.
>>>>>>>>> Be aware:  the pvalues of these enrichments is not very strong.
>>>>>>>>> 
>>>>>>>>> Please study the script, run meme, and get really familiar with
>>>>>>>>>it
>>>>>>>> all.  Send me questions if you have them.  Then run MotIV with
>>>>>>>> built-in
>>>>>>>> jaspar matrices, comparing the enriched motifs meme found, to the
>>>>>>>> jaspar
>>>>>>>> matrices. 
>>>>>>>>> 
>>>>>>>>> - Paul
>>>>>>>>> 
>>>>>>>>> <PastedGraphic-1.png>
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Jun 8, 2012, at 2:48 PM, Jing Huang wrote:
>>>>>>>>> 
>>>>>>>>>> Hi Paul,
>>>>>>>>>> 
>>>>>>>>>> Here is the list but only to you. MCM2,MCM3,MCM4,MCM5,MCM6,
>>>>>>>> MCM7,MCM8. The corresponding ENTREZ ID are,
>>>>>>>> 4171,4172,4173,4174,4175,4176,84515.
>>>>>>>>>> 
>>>>>>>>>> I will play with the meme as your email suggested.
>>>>>>>>>> 
>>>>>>>>>> Have a nice weekend
>>>>>>>>>> 
>>>>>>>>>> Jing
>>>>>>>>>> 
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Paul Shannon [mailto:pshannon at fhcrc.org]
>>>>>>>>>> Sent: Friday, June 08, 2012 2:40 PM
>>>>>>>>>> To: Jing Huang
>>>>>>>>>> Cc: Paul Shannon
>>>>>>>>>> Subject: Re: promoter prediction
>>>>>>>>>> 
>>>>>>>>>> Well, two promoters are not enough of a sample in which to find
>>>>>>>> motif enrichments.  I'll dredge up an example dataset from
>>>>>>>> elsewhere.
>>>>>>>>>> 
>>>>>>>>>> In preparation, you could install meme, and seeing if you can
>>>>>>>>>> adapt
>>>>>>>> the 'get.promoter' function I sent you, for arabidopsis, to human.
>>>>>>>>>> 
>>>>>>>>>> I will have a human demo ready mid-week next week.
>>>>>>>>>> 
>>>>>>>>>> - Paul
>>>>>>>>>> 
>>>>>>>>>> On Jun 8, 2012, at 2:36 PM, Jing Huang wrote:
>>>>>>>>>> 
>>>>>>>>>>> I don't remember what the inputs were. Somebody posted a
>>>>>>>>>>>question
>>>>>>>> on the package to our mailing group and I saw it and played with a
>>>>>>>> little bit.
>>>>>>>>>>> 
>>>>>>>>>>> The list of gene is confidential. How about I only give you two
>>>>>>>>>>> of
>>>>>>>> them MCM2 and MCM3. The correspond ENTREZ ID are 4171 and 4172.
>>>>>>>>>>> 
>>>>>>>>>>> I hope this is enough information.
>>>>>>>>>>> 
>>>>>>>>>>> Jing
>>>>>>>>>>> 
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: Paul Shannon [mailto:pshannon at fhcrc.org]
>>>>>>>>>>> Sent: Friday, June 08, 2012 2:19 PM
>>>>>>>>>>> To: Jing Huang
>>>>>>>>>>> Cc: Paul Shannon
>>>>>>>>>>> Subject: Re: promoter prediction
>>>>>>>>>>> 
>>>>>>>>>>> Hi Jing,
>>>>>>>>>>> 
>>>>>>>>>>> Do you know what inputs are used for the package you are trying
>>>>>>>>>>> to
>>>>>>>> remember?  I cannot think what it would be.
>>>>>>>>>>> 
>>>>>>>>>>> Also (I asked this before :}) do you have a list of specific
>>>>>>>> co-regulated genes?  Are they confidential?  If not, please sent
>>>>>>>>me
>>>>>>>> that
>>>>>>>> list.
>>>>>>>>>>> 
>>>>>>>>>>> - Paul
>>>>>>>>>>> 
>>>>>>>>>>> On Jun 8, 2012, at 2:16 PM, Jing Huang wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> HI Paul,
>>>>>>>>>>>> 
>>>>>>>>>>>> I am still studying the a few packages related to predict the
>>>>>>>> shared transcription factor and waiting for you for the new
>>>>>>>>advanced
>>>>>>>> package to be released.
>>>>>>>>>>>> 
>>>>>>>>>>>> There is a BIoC package that allows me to predict promoters. I
>>>>>>>> have played with it but don't remember the name of the package. Do
>>>>>>>> you
>>>>>>>> know there is such package by any chance.
>>>>>>>>>>>> 
>>>>>>>>>>>> Many thanks
>>>>>>>>>>>> 
>>>>>>>>>>>> Jing
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> <PastedGraphic-1.png>
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>