[BioC] get position information for different gene/transcript IDs

Thu Aug 19 20:50:21 CEST 2010

Hi Shirly,

That is a very good question!  Looking at the latest builds of the
annotations (which is what we normally would recommend), the manual page
will now tell you that the source for the latest CHRLOC and CHRLOCEND
information is from here:

     Mappings were based on data provided by: UCSC Genome
     Bioinformatics (Homo sapiens)
     ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19

That means that this information corresponds to the more recent "hg19"
build which in turn will correspond to "build 37".  So if you want older
builds you could use older versions of the org packages, but I since I
would never in good conscience recommend using stale annotations, I
think that in that case you would be much better off to do the initial
steps I showed you (to get the entrez gene IDs), and THEN using the
GenomicFeatures package, which should allow you to get positions for the
hg18 build mapped onto entrez gene IDs.

You can see a vignette for GenomicFeatures here:

http://www.bioconductor.org/packages/release/bioc/html/GenomicFeatures.html

You would basically just need to use makeTranscriptDbFromUCSC(),
serialize the result for safekeeping, and then call transcripts() on the
object it gave you.  Once you had done so, you could simply ask for the

library(GenomicFeatures)
foo = makeTranscriptDbFromUCSC(genome="hg18",tablename="knownGene")
bar = transcriptsBy(foo, by="gene")
## At this point you will have  GRanges object which is really useful
for a lot of great stuff,
## but since you are only interested in matching the gene_ids to the
positions,
## you can just pare it down to a much simpler object like so:

frm = as.data.frame(bar)

## The values in "element" will be the names of the elements from the
GRangesList object bar. 
## If you check them you will see that for this UCSC track, they are the
entrez gene IDs that you want to merge with.

Hope this helps,

  Marc

On 08/18/2010 06:26 PM, shirley zhang wrote:
> Hi Marc,
>
> Thanks a lot for your information. It is very very helpful. May I ask
> one more question?
>
> In my case,  the snp_position is based on Human Build 36.  The package
> org.Hs.eg.db was updated in October 2009, so can I assume it is on
> Build 36?
>
> Thanks again,
> Shirely
>
> On Wed, Aug 18, 2010 at 12:56 PM, Marc Carlson <mcarlson at fhcrc.org> wrote:
>   
>> Hi Shirley,
>>
>> You can find this information in an organism package.  org.Hs.eg.db for
>> humans (for example).
>>
>> What will be harder will be getting all the information from the mixed
>> bag of IDs that you describe.  But you can still retrieve it for each ID
>> type separately like this example for refseq IDs:
>>
>> ## So for refseq you can start by getting the entrez gene ID
>> library(org.Hs.eg.db)
>> a = c("NM_000014", "NM_000015", "foo")
>> gene_id = unlist2(mget(a, revmap(org.Hs.egREFSEQ), ifnotfound=NA))
>> gene_id = gene_id[!is.na(gene_id)]
>>
>> ## And then once you have the IDs, you can map them to a position
>> chrloc = toTable(org.Hs.egCHRLOC[gene_id])
>> chrlocend = toTable(org.Hs.egCHRLOCEND[gene_id])
>>
>> ## And now because we were careful to make sure that we always have an
>> entrez gene ID,
>> ## you can just merge all the results together to make this easier to
>> look at:
>> merge(merge(data.frame(gene_id=gene_id, refseq=names(gene_id)), chrloc),
>> chrlocend)
>>
>>
>> ## This process can then be repeated for other kinds of IDs etc:
>> b = c("Hs.100217", "Hs.100299")
>> gene_id = unlist2(mget(b, revmap(org.Hs.egUNIGENE), ifnotfound=NA))
>> gene_id = gene_id[!is.na(gene_id)]
>>
>> chrloc = toTable(org.Hs.egCHRLOC[gene_id])
>> chrlocend = toTable(org.Hs.egCHRLOCEND[gene_id])
>> ## etc.
>>
>> Does that help?
>>
>>
>>  Marc
>>
>>
>>
>>
>>
>> On 08/17/2010 07:26 PM, shirley zhang wrote:
>>     
>>> Dear list,
>>>
>>> I have a list of human SNPs and genes/transcripts which are obtained
>>> from different studies.  I would like to calculate the distance
>>> between these SNPs and genes/transcripts on the same chromosome.
>>> However, the ID for the gene/transcript across studies  is not
>>> consistent. For example, the ID could be Gene Symbol/Alias (TRIM5l),
>>> RNA nucleotide accession (NM_013387), XM_173221 (its gene symbol is
>>> LOC254266, but its record is an obsolete version), Contig22780_RC,
>>> HSS00001378, Hs.445401 (UniGene entry Hs.445401), etc.
>>>
>>> I would like to retrieve the start and end position for those
>>> gene/transcript. Can anybody help me how to get the position
>>> information based on these different gene/transcript IDs. If I also
>>> need to match all of other IDs to gene symbol/entrezID, what kind of
>>> database or bioConductor package will allow me to do the mapping as
>>> perfect as possible? How about biomaRt package?
>>>
>>> Thanks
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>       
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>     
>
>
>