[BioC] How can I identify the closest transcript from a chromosome coordinate?

Fri Jul 20 04:36:56 CEST 2012

Dear all,

I'm working on a DNA Methylation microarray dataset. The microarray design is "pd.feinberg.hg18.me.hx1".

I used the CHARM package to estimate methylation percentile and selected 1000 probes having larger variances of methylation level across samples.

The 1000 probe are identified as chromosome coordinate like following.

> rnames[1:10]
 [1] "chr1:1707145" "chr1:2148663" "chr1:3133683" "chr1:3180808" "chr1:3294081"
 [6] "chr1:3470900" "chr1:3470969" "chr1:3633816" "chr1:3676205" "chr1:3720637"

Now I want to see the gene expression of these 1000 probes and see the correlation between gene expression and dna methylation.

I loaded human genome transcript information from UCSC and extracted features of all transcripts like followings.

hg18KG<-loadFeatures("hg18_UCSC.sqlite")
tbl_tx<-select(hg18KG,keys(hg18KG,"GENEID"),cols=c("GENEID","TXNAME","TXCHROM","TXSTRAND","TXSTART","TXEND"),keytype="GENEID")
> tbl_tx[1:10,]
   GENEID     TXNAME TXCHROM TXSTRAND   TXSTART     TXEND
1       1 uc002qsd.2   chr19        -  63549984  63556677
2       1 uc002qsf.1   chr19        -  63551644  63565932
3      10 uc003wyw.1    chr8        +  18293035  18303003
4      10 uc010lte.1    chr8        +  18301794  18302666
5     100 uc002xmj.1   chr20        -  42681577  42713790
6     100 uc010ggt.1   chr20        -  42681577  42713790
7    1000 uc002kwg.1   chr18        -  23784933  24011189
8   10000 uc001iaa.2    chr1        - 241731689 241733518
9   10000 uc001hzz.1    chr1        - 241718158 242073207
10  10000 uc001iab.1    chr1        - 241733107 242073207

For each of 1000 probes, I want to find the closest transcript starting point (TXSTART). 

But I don't know how to treat strand. There was no strand information provided from raw data but transcripts have strand information (either "+" or "-").

How I can calculate distance from probe coordinate to transcript starting point which is on strand "+" or "-"? 

Can I just ignore "+" or "-" which allows me to treat +111111 and -111111 in the same way? My guess they should be different because genome sequence shouldn't be symmetric. 

I just started to join genomics field from different area and have little experience working on genome sequences. Sorry for my naive question.

But any comments about this, even conceptual ones, would be very helpful for me.

Thank you.

Seungyeul Yoo

Postdoctoral Fellow
Institute of Genomics and Multiscale Biology
Department of Genetics and Genomic Sciences
Mount Sinai School of Medicine
(office) 212-659-6877