[BioC] unmapped keys in hugene10stprobeset.db

Tue Aug 17 19:12:50 CEST 2010

Hi Mark,

You should talk to me about annotations.  I maintain the annotation
repository here and make sure that all of the packages get re-made for
each release etc..  This particular package was contributed and is
maintained by Arthur Li.  So I will contact the two of you off list as
needed, depending on what you find out in the "improvement" department.

Something that may help you to be aware of as you explore this is that
the annotations and the SQLForge code that generates them are all entrez
gene centric.  So you need to be able to connect the probe to an entrez
gene ID that was not mapped to before in order to "improve" them.  But,
if you have new information about probes that map to things like
microRNAs, then that really could help since there *are* entrez gene IDs
for those things in NCBI (and in our supporting "org" packages.  This is
true even though these things are not really genes in the strictest
sense of the word. 

  Marc

On 08/16/2010 05:05 PM, Mark Cowley wrote:
> hi Paul & Marc,
> in addition to the thousands of control probes, there are non protein coding genes on these arrays - things like snoRNA's and precursor microRNA's which might not have a classical gene symbol.
> I find that the mrna_assignment column from the Affy csv has a lot more information for these genes than the BioC annotation packages, so i'll do what you suggest Marc & try to 'improve' the mapping via the SQLForge code. I've done a fair amount of the groundwork on this already, so who could I communicate these changes to?
>
> cheers,
> Mark
> -----------------------------------------------------
> Mark Cowley, PhD
>
> Peter Wills Bioinformatics Centre
> Garvan Institute of Medical Research, Sydney, Australia
> -----------------------------------------------------
>
> On 17/08/2010, at 9:26 AM, Marc Carlson wrote:
>
>   
>> Hi Paul,
>>
>> I looked into this for you.  Often there will be discrepancies like this
>> for purely historical reasons.  For example, Affy may have made the
>> probes based on one idea about what the transcriptome looked like and
>> then this could have changed by the time they shipped their product. 
>> That kind of discrepancy happens all the time and especially with older
>> chips.  But in your case, you also seem to have a lot of control probes
>> on this platform. 
>>
>> You can extract the unmatched probes like this:
>>
>> library (hugene10stprobeset.db)
>> a = hugene10stprobesetENTREZID
>> oddProbes = keys(a)[! (keys(a) %in% mappedkeys(a))]
>>
>> I actually pulled down the .csv mapping from Affymetrix that Arthur Li
>> would have used to generate this database.  And I noticed that all the
>> oddProbes I was looking at were control probes.  In fact, more than 4
>> thousand of these probes are control probes.  Looking more closely at
>> this file, you will see that many, many other probes have no gene
>> mapping to them even though they are not listed as control probes.  What
>> is going on with some of those probesets?  Why has Affy refused to
>> assign an identity those ones?  That is really more of a question for
>> Affymetrix than for us.
>>
>> When we map these IDs to make annotation packages, we look for known
>> gene IDs from the manufacturer (unigene, refseq etc.), and we then map
>> those onto entrez gene IDs from NCBI and from there onto other
>> annotations.  But if the people who make the array are not willing to
>> tell us what these things map to then we could really only speculate
>> about what they are.
>>
>> But, if you have some external information that helps you to decide what
>> these probes really map to, (maybe you have mapped the probesets onto
>> the genome yourself or else maybe you feel that you can extract a little
>> more data out of Affys .csv file than this author did), then in that
>> case you can always feed that "improved" mapping into the SQLForge code
>> in the AnnotationDbi package and generate your very own version of this
>> annotation package.  It is pretty straightforward to do so and is
>> described in the SQLForge vignette here:
>>
>> http://www.bioconductor.org/packages/release/bioc/html/AnnotationDbi.html
>>
>> I hope this helps explain things,
>>
>>
>>  Marc
>>
>>
>>
>>
>>
>>
>> On 08/16/2010 02:17 PM, Paul Shannon wrote:
>>     
>>> Here's an annotation question someone might be able to help me out with.  I'll be grateful.
>>>
>>> Affymetrix describes their 'GeneChip Human Gene 1.0 ST Array':
>>>
>>> Each of the 28,869 genes is represented on the array by approximately 26 probes spread across the full length of the gene, providing a more complete and more accurate picture of gene expression than 3’ based expression array designs. ... The Gene 1.0 ST Array uses a subset of probes from the Human Exon 1.0 ST Array and covers only well-annotated content. 
>>>
>>> This sounds to me like affy started with sequence from exons of ~29k genes and created probes.  
>>> But when I look at the bioc annotation for this chip (hugene10stprobeset.db, hugene10sttranscriptcluster.db), I find that about 7% of the probes are NOT annotated to geneIDs.  The sibling array, hugene10sttranscriptclusterENTREZID, though smaller, has a higher proportion of unmapped probes.
>>>
>>>    library (hugene10stprobeset.db)
>>>    library (hugene10sttranscriptcluster.db)
>>>    bm = hugene10stprobesetENTREZID
>>>    length (keys (bm))    #  257022
>>>    count.mappedkeys (bm) #  238141
>>>                 # unmapped:  18881
>>>     cm = hugene10sttranscriptclusterENTREZID
>>>     length (keys (cm));   #  33257                                                                                                                         
>>>     count.mappedkeys (cm) #  21787
>>>
>>> The same proportions (unmapped/mapped seems to hold true for hugene10stprobesetENSEMBL as well.
>>>
>>> Can anyone suggest where I can get entrez geneID annotations for these unmapped probes?   Or otherwise clear up my confusion? 
>>>
>>> Thanks!
>>>
>>>  - Paul
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>       
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>     
>
>
>
>