[BioC] How to get position for each gene ID/gene symbol instead of position for each transcript

Wed Aug 25 04:20:42 CEST 2010

Hi Shirley,

On Tue, Aug 24, 2010 at 9:07 PM, shirley zhang <shirley0818 at gmail.com> wrote:
> Hi Marc,
>
> Thanks a lot for your detailed reply. It is very helpful.
>
> I tried library "GenomicFeatures" and your code. It works great. I did
> get position for each Transcript.  However I am only interested in
> obtaining position for each gene ID/gene symbol instead of position
> for each transcript. For example, if one gene has multiple
> transcripts, I only want the biggest boundary of each gene which
> includes all of its transcripts. Does GenomicFeatures or other package
> can do this? Or I have to compute the gene position by combining all
> of its corresponding transcripts?

You can do this pretty "simply" with GenomicFeatures, if you want to
stick with that:

R> txdb <- loadFeatures('your.transcript.db')
R> xcripts <- transcriptsBy(txdb, by='gene')

## This part is really slow -- this will be subject of next email
R> gene.bounds <- seqapply(xcripts, reduce)

the names() of gene.bounds is the entrez.id of the gene. You can use
the org.Hs.eg.db pacakges

R> library(org.Hs.eg.db)
R> symbols <- mget(names(gene.bounds), org.Hs.egSYMBOL, ifnotfound=NA)

symbols will now be a list (names are entrez ids, values are the gene
symbols) that you can manipulate in "the standard R way"

Hope that helps,

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact