[BioC] question about TranscriptDb

Tue Dec 11 00:34:50 CET 2012

Unfortunately, if we did that, there could be all sorts of unfortunate 
consequences.

By doing this, you would be introducing an arbitrary number of new 
strings as IDs for all of these orphaned transcripts.  And unlike NAs 
(which is the traditional way of indicating that data is missing in R), 
you would get no warnings about any of these when you used them in 
subsequent analysis.   Others could use your new faux IDs to get into 
all sorts of trouble.  And would be even worse because they would mixed 
in with real IDs (entrez gene IDs) which would lend them a confusing air 
of authenticity.  Downstream users might even mix the faux IDs from 
different species etc.

And even if we accepted the risks, we don't even have a good way of 
always grouping the unassigned transcripts, which means that transcripts 
that are probably from the same gene will be assigned like this:

unknown1 = tx1 (overlaps with tx2)
unknown2 = tx2 (overlaps with tx1)
etc.

Which means that this strategy would also end up implying things that we 
know are sometimes not going to be true.   Meanwhile these half wrong 
unknown transcript assignments will be mixed in with the "real" ones...

I could go on and on, but I am hoping you can see some of what I am 
concerned about?

Anyhow you can already discover about which genes are associated with 
transcripts in many other ways.  The simplest approach is probably to 
just use select() like this:

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb = TxDb.Hsapiens.UCSC.hg19.knownGene
k = keys(txdb, "TXNAME")
res <- select(txdb, cols=c("TXNAME","GENEID"), keys=k, keytype="TXNAME")
head(res)

Alternatively you could ALSO do something like this (if you had 
previously already called transcripts like below):

t <- transcripts(txdb,columns="gene_id")
as.character(mcols(t)$gene_id)

   Marc

On 12/10/2012 12:25 PM, Ryan C. Thompson wrote:
> I have also been bitten by the fact that some transcripts are missing 
> gene IDs. Is it possible to add placeholder gene IDs to these? For 
> example, just assigning them UNKNOWN1, UNKNOWN2, etc.?
>
> On Mon 10 Dec 2012 11:40:35 AM PST, Marc Carlson wrote:
>> Hi Matthew,
>>
>> Thanks for your detailed exploration of this. After looking more
>> closely, I think the confusion here is being caused by the fact that you
>> are looking at the kgXref table, and what was actually used to attach
>> gene Ids to the TxDb database is actually the knownToLocusLink
>> <http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=316115443&hgta_doSchemaDb=hg19&hgta_doSchemaTable=knownToLocusLink> 
>>
>> table.  Adding to the mayhem, UCSC has apparently decided to allow
>> different values to exist into the latest versions of these two tables.
>>
>> We chose to use the Entrez Gene IDs as gene identifiers because (unlike
>> gene symbols) they represent a real identifier and can thus be relied on
>> to not have multiple different meanings etc.
>>
>>
>>    Marc
>>
>>
>>
>> On 12/10/2012 09:06 AM, Matthew D. Wilkerson wrote:
>>> Hello,
>>>
>>> I have a question about the gene_id attribute of
>>> TxDb.Hsapiens.UCSC.hg19.knownGene, version 2.80 (latest).
>>>
>>> I noticed that some transcripts such as uc021ums.1, do not have an
>>> associated gene_id.
>>>
>>> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>> t=transcripts(txdb,columns=c("gene_id","tx_id","tx_name","cds_id","cds_name")) 
>>>
>>>
>>> t[ which(elementMetadata(t)[,"tx_name"]=="uc021ums.1"), ]
>>>
>>> I understand that some ucsc genes might not have an entrez gene id
>>> associated.
>>> I checked this locus and found that currently UCSC db does have this
>>> locus associated with LINGO3.
>>>
>>> #hg19.knownGene.name    hg19.knownGene.chrom
>>> hg19.knownGene.strand    hg19.knownGene.txStart
>>> hg19.knownGene.txEnd    hg19.knownGene.cdsStart
>>> hg19.knownGene.cdsEnd    hg19.knownGene.exonCount
>>> hg19.knownGene.exonStarts    hg19.knownGene.exonEnds
>>> hg19.knownGene.proteinID    hg19.knownGene.alignID
>>> hg19.kgXref.kgID    hg19.kgXref.geneSymbol
>>> uc021ums.1    chr19    -    2289996    2291775    2289996
>>> 2291775    1    2289996,    2291775,    P0C6S8    uc021ums.1
>>> uc021ums.1    LINGO3
>>>
>>>
>>> The kgXref table was last updated  2/5/12.
>>>
>>>
>>> The bioconductor package was made on:
>>> Creation time: 2012-09-10 12:56:25 -0700 (Mon, 10 Sep 2012)
>>>
>>> If this date also refers to the date of download, then why is this
>>> transcript not affiliated with LINGO3?
>>> If not, then what date does known gene refer to?
>>>
>>>
>>> Thanks,
>>> Matt
>>>
>>
>>
>>     [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor