[BioC] get position information for different gene/transcript IDs

Wed Aug 18 18:56:33 CEST 2010

Hi Shirley,

You can find this information in an organism package.  org.Hs.eg.db for
humans (for example).

What will be harder will be getting all the information from the mixed
bag of IDs that you describe.  But you can still retrieve it for each ID
type separately like this example for refseq IDs:

## So for refseq you can start by getting the entrez gene ID
library(org.Hs.eg.db)
a = c("NM_000014", "NM_000015", "foo")
gene_id = unlist2(mget(a, revmap(org.Hs.egREFSEQ), ifnotfound=NA))
gene_id = gene_id[!is.na(gene_id)]

## And then once you have the IDs, you can map them to a position
chrloc = toTable(org.Hs.egCHRLOC[gene_id])
chrlocend = toTable(org.Hs.egCHRLOCEND[gene_id])

## And now because we were careful to make sure that we always have an
entrez gene ID,
## you can just merge all the results together to make this easier to
look at:
merge(merge(data.frame(gene_id=gene_id, refseq=names(gene_id)), chrloc),
chrlocend)

## This process can then be repeated for other kinds of IDs etc:
b = c("Hs.100217", "Hs.100299")
gene_id = unlist2(mget(b, revmap(org.Hs.egUNIGENE), ifnotfound=NA))
gene_id = gene_id[!is.na(gene_id)]

chrloc = toTable(org.Hs.egCHRLOC[gene_id])
chrlocend = toTable(org.Hs.egCHRLOCEND[gene_id])
## etc.

Does that help?

  Marc

On 08/17/2010 07:26 PM, shirley zhang wrote:
> Dear list,
>
> I have a list of human SNPs and genes/transcripts which are obtained
> from different studies.  I would like to calculate the distance
> between these SNPs and genes/transcripts on the same chromosome.
> However, the ID for the gene/transcript across studies  is not
> consistent. For example, the ID could be Gene Symbol/Alias (TRIM5l),
> RNA nucleotide accession (NM_013387), XM_173221 (its gene symbol is
> LOC254266, but its record is an obsolete version), Contig22780_RC,
> HSS00001378, Hs.445401 (UniGene entry Hs.445401), etc.
>
> I would like to retrieve the start and end position for those
> gene/transcript. Can anybody help me how to get the position
> information based on these different gene/transcript IDs. If I also
> need to match all of other IDs to gene symbol/entrezID, what kind of
> database or bioConductor package will allow me to do the mapping as
> perfect as possible? How about biomaRt package?
>
> Thanks
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>