[BioC] A basic question about retrieve the absolute genome position from the output of of function matchPWM .

Sun Oct 28 03:54:39 CET 2012

Hi, Steve,
Thank you.

  49449428+999-354 =49450073
  49449428+999-347= 49450080
By searching

chr3:49450073-49450080 I got [GGTGAGGC], which is reversely complementary to [AAGCCTCA] as expected.

> reverseComplement(DNAString("GGTGAGGC"))
   8-letter "DNAString" instance
seq: CAGCCTCA

  More question is, to screen the "promoter" region, what should I do?
Should I focus on the "upstream" of "positive" strand? Does it equally to search the "downstream" of "negative" strand, if the program reports sequence of only negative strain?

Thanks again,
Holly

On 10/27/2012 6:19 PM, Steve Lianoglou wrote:

> Hi Holly,
>
> On Sat, Oct 27, 2012 at 7:23 PM, Holly <xyang2 at uchicago.edu> wrote:
>> Dear all,
>> I am learning how to "Finding Candidate Binding Sites for Known
>> Transcription Factors via Sequence Matching" from the Bioconductor
>> online help file at
>> http://www.bioconductor.org/help/workflows/gene-regulation-tfbs/
>>
>> When I ran the following codes, I can not match the results back to the
>> UCSC genome browser results.
>>
>> ||>|orfs <- as.character(mget("RHOA", org.Hs.egSYMBOL2EG))
>> ||||>||  pfm.grhl <- query(MotifDb,"GRHL")[[1]]
>>> pcm.grhl <- round(100 * pfm.grhl)
>>> chromosomal.loc <-  transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, by="gene") [orfs]
>>> promoter.seqs <-  getPromoterSeq(chromosomal.loc, Hsapiens, upstream=1000, downstream=0)
>>> promoter.seqs <- unlist(promoter.seqs)
>>> matchPWM(pcm.grhl, promoter.seqs[[1]], "80%")
>>    Views on a 1000-letter DNAString subject
>> subject: AGCCCAGGTCAGGTTAGAGACCACTGGTATTGTT...GGCACTCGGAGGCGCGCACGTCGTTCCCCGCCCT
>> views:
>>       start end width
>> [1]   347 354     8 [AAGCCTCA]
>>
>>> chromosomal.loc
>> GRangesList of length 1:
>> $387
>> GRanges with 2 ranges and 2 metadata columns:
>>         seqnames               ranges strand |     tx_id     tx_name
>>            <Rle>            <IRanges>  <Rle> | <integer> <character>
>>     [1]     chr3 [49396579, 49449428]      - |     15517  uc010hku.3
>>     [2]     chr3 [49396579, 49449526]      - |     15518  uc003cwu.3
>>
>> ---
>> seqlengths:
>>                     chr1                  chr2 ...        chrUn_gl000249
>>                249250621             243199373 ...                 38502
>>
>> As|||49396579-1000+347=49395926, I use it as an absolute position.|
>> However, when I visit the UCSC genome browser, for the region chr3:||||49395926||-||||49395934.||
>>    I got [GTGCTCCTC], instead of [|||AAGCCTCA] or similar.|
>>
>> So, could anyone tell me how to calculate the absolute genome  position from the output of of function matchPWM ?.
> Beware of the evil strand.
>
> This transcript is on the negative strand, so the 1000bp upstream of
> the first transcript you've listed starts at its end, namely
> `(end+1):(end+999)`
>
> HTH,
> -steve
>