[BioC] A basic question about retrieve the absolute genome position from the output of of function matchPWM .

Mon Oct 29 16:08:22 CET 2012

Hi Holly,

Proposed simple solution inline below…

 -  Paul

On Oct 28, 2012, at 10:42 AM, Holly wrote:

> Paul,
> Your demon is much clear for a beginner. Thanks a lot.
> Talking back for searching putative binding site, when you say "
> Another thing to keep in mind here is something suggested by Steve a couple of weeks ago: it may be better to think in terms of proximal regulatory regions, a full 5000 bp upstream AND downstream of the transcription start site.  This is especially applicable to metazoans, and to mammals in particular."
> 
> How to get the sequence of, e.g. "100bp downstream of the transcription start site" using Bioconductor packages and R code?
> I appreciate if you can provide such demo code as well.
> 
> As I understand, let 
> TSS =(transcription start site), TTS =(transcription termination site 
> getPromoterSeq(loc, Hsapiens, upstream=0, downstream=10) , we get the sequence TTS:TTS+10
> getPromoterSeq(loc, Hsapiens, upstream=10, downstream=0) , we get the sequence TSS-10:TSS
> Now, we want to look at TSS-10:TSS+10, if I understand you correctly.

Just one small modification to the example code is needed:

  getPromoterSeq(loc, Hsapiens, upstream=10, downstream=10)

> 
> Thank you,
> Holly
> 
> On 10/27/2012 11:56 PM, Paul Shannon wrote:
>> Hi Holly,
>> 
>> See comments and suggestions below.
>> 
>> 
>> On Oct 27, 2012, at 7:54 PM, Holly wrote:
>> 
>> 
>>> > 
>>> Hi, Steve,
>>> 
>>> > 
>>> Thank you.
>>> 
>>> > 
>>> > 
>>> 49449428+999-354 =49450073
>>> 
>>> > 
>>> 49449428+999-347= 49450080
>>> 
>>> > 
>>> By searching
>>> 
>>> > 
>>> > 
>>> chr3:49450073-49450080 I got [GGTGAGGC], which is reversely complementary to [AAGCCTCA] as expected.
>>> 
>>> > 
>>>> >> 
>>>> reverseComplement(DNAString("GGTGAGGC"))
>>>> 
>>> > 
>>>  8-letter "DNAString" instance
>>> 
>>> > 
>>> seq: CAGCCTCA
>>> 
>>> > 
>>> > 
>>> More question is, to screen the "promoter" region, what should I do?
>>> 
>>> > 
>>> Should I focus on the "upstream" of "positive" strand? Does it equally to search the "downstream" of "negative" strand, if the program reports sequence of only negative strain?
>>> 
>> getPromoterSeq is strand-aware in two ways.  For gene transcripts annotated to the minus strand:
>> 
>>    1) It interprets 'upstream' as 'increasing chromosomal coordinates'.
>>    2) It returns the reverse complement of the extracted sequence, which is what you probably want.
>> 
>> In the following, I specify an upstream length of just 10, to make experimentation easier:
>> 
>> library(org.Hs.eg.db)
>> library(MotifDb)
>> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>> library(BSgenome.Hsapiens.UCSC.hg19)
>> 
>> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
>> geneID <-as.character (get ("RHOA", org.Hs.egSYMBOL2EG))
>> loc <- transcriptsBy(txdb, by="gene") [geneID][[1]][1]
>>  GRanges with 1 range and 2 metadata columns:
>>       seqnames               ranges strand |     tx_id     tx_name
>>          <Rle>            <IRanges>  <Rle> | <integer> <character>
>>   [1]     chr3 [49396579, 49449428]      - |     15517  uc010hku.3
>> 
>> getPromoterSeq(loc, Hsapiens, upstream=10, downstream=0)
>>    A DNAStringSet instance of length 1
>>      width seq
>>         10 TCCCCGCCCT
>> 
>> If you point the UCSC genome browser, using genome=hg19, to chr3:49449428:49449437, click the 'complement bases' arrow in the top left, you will see, reading from right to left, and starting at chr3:49449437, the same sequence returned by getPromoterSeq: TCCCCGCCCT.  
>> 
>> Another handy trick is to ask getPromoterSeq to return 20 bases upstream.  This is a long enough sequence to use UCSC BLAT, which will take you right to the proper location in the genome browser.  
>> http://genome.ucsc.edu/cgi-bin/hgBlat?command=start
>> .  BLAT will try to match your input sequence on both strands, and report which one it found.
>> 
>> I believe that all of this demonstrates that getPromoterSeq and its upstream and downstream arguments are relative to the gene transcript's TSS, and are corrected to return a "gene's-eye" view of the sequence.  For a minus strand gene, upstream is in increasing genome coordinates, and the returned sequence has been reverse complemented.  For a plus strand gene, upstream is decreasing genome coordinates, and the returned sequence can be read off the genome from left  to right.  You don't have to worry about strand, nor compensate for it.
>> 
>> Another thing to keep in mind here is something suggested by Steve a couple of weeks ago: it may be better to think in terms of proximal regulatory regions, a full 5000 bp upstream AND downstream of the transcription start site.  This is especially applicable to metazoans, and to mammals in particular.
>> 
>> I hope this helps,
>> 
>>  - Paul
>> 
>> 
>> 
>>> > 
>>> Thanks again,
>>> 
>>> > 
>>> Holly
>>> 
>>> > 
>>> > 
>>> On 10/27/2012 6:19 PM, Steve Lianoglou wrote:
>>> 
>>> > 
>>>> >> 
>>>> Hi Holly,
>>>> 
>>>> >> 
>>>> >> On Sat, Oct 27, 2012 at 7:23 PM, Holly <xyang2 at uchicago.edu>
>>>>  wrote:
>>>> 
>>>>> >>> 
>>>>> Dear all,
>>>>> 
>>>>> >>> 
>>>>> I am learning how to "Finding Candidate Binding Sites for Known
>>>>> 
>>>>> >>> 
>>>>> Transcription Factors via Sequence Matching" from the Bioconductor
>>>>> 
>>>>> >>> 
>>>>> online help file at
>>>>> 
>>>>> >>> http://www.bioconductor.org/help/workflows/gene-regulation-tfbs/
>>>>> >>> 
>>>>> >>> 
>>>>> When I ran the following codes, I can not match the results back to the
>>>>> 
>>>>> >>> 
>>>>> UCSC genome browser results.
>>>>> 
>>>>> >>> 
>>>>> >>> 
>>>>> ||>|orfs <- as.character(mget("RHOA", org.Hs.egSYMBOL2EG))
>>>>> 
>>>>> >>> 
>>>>> ||||>||  pfm.grhl <- query(MotifDb,"GRHL")[[1]]
>>>>> 
>>>>>> >>>> 
>>>>>> pcm.grhl <- round(100 * pfm.grhl)
>>>>>> 
>>>>>> >>>> 
>>>>>> chromosomal.loc <-  transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, by="gene") [orfs]
>>>>>> 
>>>>>> >>>> 
>>>>>> promoter.seqs <-  getPromoterSeq(chromosomal.loc, Hsapiens, upstream=1000, downstream=0)
>>>>>> 
>>>>>> >>>> 
>>>>>> promoter.seqs <- unlist(promoter.seqs)
>>>>>> 
>>>>>> >>>> 
>>>>>> matchPWM(pcm.grhl, promoter.seqs[[1]], "80%")
>>>>>> 
>>>>> >>> 
>>>>>   Views on a 1000-letter DNAString subject
>>>>> 
>>>>> >>> 
>>>>> subject: AGCCCAGGTCAGGTTAGAGACCACTGGTATTGTT...GGCACTCGGAGGCGCGCACGTCGTTCCCCGCCCT
>>>>> 
>>>>> >>> 
>>>>> views:
>>>>> 
>>>>> >>> 
>>>>>      start end width
>>>>> 
>>>>> >>> 
>>>>> [1]   347 354     8 [AAGCCTCA]
>>>>> 
>>>>> >>> 
>>>>>> >>>> 
>>>>>> chromosomal.loc
>>>>>> 
>>>>> >>> 
>>>>> GRangesList of length 1:
>>>>> 
>>>>> >>> 
>>>>> $387
>>>>> 
>>>>> >>> 
>>>>> GRanges with 2 ranges and 2 metadata columns:
>>>>> 
>>>>> >>> 
>>>>>        seqnames               ranges strand |     tx_id     tx_name
>>>>> 
>>>>> >>> 
>>>>>           <Rle>            <IRanges>  <Rle> | <integer> <character>
>>>>> 
>>>>> >>> 
>>>>>    [1]     chr3 [49396579, 49449428]      - |     15517  uc010hku.3
>>>>> 
>>>>> >>> 
>>>>>    [2]     chr3 [49396579, 49449526]      - |     15518  uc003cwu.3
>>>>> 
>>>>> >>> 
>>>>> >>> 
>>>>> ---
>>>>> 
>>>>> >>> 
>>>>> seqlengths:
>>>>> 
>>>>> >>> 
>>>>>                    chr1                  chr2 ...        chrUn_gl000249
>>>>> 
>>>>> >>> 
>>>>>               249250621             243199373 ...                 38502
>>>>> 
>>>>> >>> 
>>>>> >>> 
>>>>> As|||49396579-1000+347=49395926, I use it as an absolute position.|
>>>>> 
>>>>> >>> 
>>>>> However, when I visit the UCSC genome browser, for the region chr3:||||49395926||-||||49395934.||
>>>>> 
>>>>> >>> 
>>>>>   I got [GTGCTCCTC], instead of [|||AAGCCTCA] or similar.|
>>>>> 
>>>>> >>> 
>>>>> >>> 
>>>>> So, could anyone tell me how to calculate the absolute genome  position from the output of of function matchPWM ?.
>>>>> 
>>>> >> 
>>>> Beware of the evil strand.
>>>> 
>>>> >> 
>>>> >> 
>>>> This transcript is on the negative strand, so the 1000bp upstream of
>>>> 
>>>> >> 
>>>> the first transcript you've listed starts at its end, namely
>>>> 
>>>> >> 
>>>> `(end+1):(end+999)`
>>>> 
>>>> >> 
>>>> >> 
>>>> HTH,
>>>> 
>>>> >> 
>>>> -steve
>>>> 
>>>> >> 
>>> > 
>>> > 
>>> _______________________________________________
>>> 
>>> > 
>>> Bioconductor mailing list
>>> 
>>> > Bioconductor at r-project.org
>>> > https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>