[BioC] (missing?) UCSCKG -> SYMBOL mappings in Homo.sapiens (etc.)

Tue Feb 12 22:22:24 CET 2013

Hi Tim,

First of all let me assure you that we have NOT abandoned UCSC known 
gene IDs.  They have just been migrated to another field (TXNAME).  The 
reason for the deprecation is just so that people don't rely on getting 
them in this location (UCSCKG).   The rationale is that people should be 
able to get something that is in actuality a transcript ID from a 
transcript oriented object.  In spite of the severe sounding deprecation 
warning, these IDs have actually been updated with every release.  I 
have (so far) just kept updating them simply because I did not want to 
inconvenience anyone by making them go away too soon.  My hope was that 
after enough time had elapsed I could quietly remove them with minimized 
pain.  So don't panic.  But please don't keep using then either.

So the most important  thing to know is that you should get things like 
UCSC known gene IDs from the TXNAME field and from a TranscriptDb, or 
OrganismDb.  (When appropriate: since not all transcriptomes can even 
have known gene IDs.)

So to look up a gene symbol from a knownGene name you should be trying 
to do it like this:

library(Homo.sapiens
select(Homo.sapiens, cols=c("SYMBOL","TXNAME"), keys=c("uc002yjx.2"), 
keytype="TXNAME")

As for the other issues you are having with the specific IDs you were 
looking for, I have been investigating that and it appears to trace back 
to UCSCs genome browser (and their associated resources).  I will be 
therefore moving this thread to the bioc-devel list for the rest of the 
discussion.  Any interested parties can tune in over there.

   Marc

On 02/12/2013 10:04 AM, Tim Triche, Jr. wrote:
> re:  '[BioC] question about Gviz' thread fallout:
>
> Yesterday I rolled a relatively simple programmatic way to label UCSC
> KnownGene entries with their symbols.  However, some isoforms (e.g. some
> for NRIP1 and CDKN2B) seem to be missing from the mappings.
>
> Investigating a bit, and referring to ?org.Hs.egUCSCKG, I find
>
> ...This mapping is based on the very latest build available at UCSC
>     for this organism as of March 2010.  2.6 is the last release where
>     you can expect it to be here.  The GenomicFeatures package
>     contains functionality that replaces the need for this mapping...
>
> Alas, I'm too thick to find where, in the TxDb or elsewhere, I could
> retrieve Hugo IDs for UCSC KnownGene entries without using org.Hs.egSYMBOL.
>   The latter is what I usually do:
>
>    library(Homo.sapiens)
>
>    txs<- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene)
>    head(names(txs))
>    ## [1] "1"         "10"        "100"       "1000"      "10000"
> "100008586"
>
>    names(txs)<- mget(names(txs), org.Hs.egSYMBOL, ifnotfound=NA)
>    head(names(txs))
>    ## [1] "A1BG"    "NAT2"    "ADA"     "CDH2"    "AKT3"    "GAGE12F"
>
> Now, I thought for a while, hell, this gets them all!  But, not really...
>
>    txs$NRIP1
>    ## GRanges with 1 range and 2 metadata columns:
>    ##       seqnames               ranges strand |     tx_id     tx_name
>    ##<Rle>             <IRanges>   <Rle>  |<integer>  <character>
>    ##   [1]    chr21 [16333556, 16437126]      - |     71301  uc002yjx.2
>
> Well, that's one of the isoforms.  But what about the other ones?
>
>    org.Hs.egUCSCKG[[ "c002yjx.1" ]]
>    ## NULL
>
>    org.Hs.egUCSCKG[[ "uc010gkz.1" ]]
>    ## NULL
>
> I know UCSC identifiers can be a bit of a pain in the ass, but there do
> exist mappings for these.  If they're going to be used as primary
> identifiers for the TxDb packages, would it be possible to update them?
>
> If it's an issue of time constraints, I will take a stab at it, but that
> will almost guarantee more prattling from me on the mailing list.  On the
> other hand, it might move GAF3.0 annotations out of the station.
>
> Much obliged for any insights from the core developers.
>
>