[BioC] Affymetrix HuGene 2.0 ST annotation

Wed Jun 4 16:54:55 CEST 2014

Hi Natasha,

On 6/4/2014 10:20 AM, Natasha [guest] wrote:
> Dear List,
>
> I recently came across this post, that helped me in the analysis of data using this array.
> https://stat.ethz.ch/pipermail/bioconductor/2014-May/059408.html
>
> However, I am concerned about the annotation and wondered if what I get is usual for this kind of array.
>
> Code:
> eset_mat <- as.matrix(Eset)
> dim(eset_mat) #53617     6
>
> library(annotate)
> library(hugene20sttranscriptcluster.db)
>
> annodb <- "hugene20sttranscriptcluster.db"
> ID     <- featureNames(Eset)
> Symbol <- as.character(lookUp(ID, annodb, "SYMBOL"))
> Name   <- as.character(lookUp(ID, annodb, "GENENAME"))
> Entrez <- as.character(lookUp(ID, annodb, "ENTREZID"))
> Ensembl <- as.character(lookUp(ID, annodb, "ENSEMBL"))
>
> annot = data.frame("ID"=ID,"Symbol"=Symbol,"Description"=Name,"EntrezID"=Entrez,"EnsemblID"=Ensembl)
>
> length(which(Symbol != "NA")) # 23672 =====> is this normal?
> length(Symbol))  # 53617
> -----
> Is it normal to get <50% annotation?

Sort of. You are using an old method of annotating data that still 
exists for backwards compatibility, but is not really how you should be 
doing things these days. Note also that this old method of annotating 
probesets masked any probes with a one-to-many annotation. If we toggle 
this masking off, you get about 7000 more symbols:

 > z <- toggleProbes(hugene20sttranscriptclusterSYMBOL, "all")
 > symbol2 <- unlist(mget(ID, z))
 > symbol2 <- symbol2[!is.na(symbol2)]
 > sum(!duplicated(names(symbol2)))
[1] 30769

This also masks the fact that a given probeset might interrogate lots of 
things

 > z <- select(hugene20sttranscriptcluster.db, 
keys(hugene20sttranscriptcluster.db), 
c("SYMBOL","GENENAME","ENTREZID","ENSEMBL"))
Warning message:
In .generateExtraRows(tab, keys, jointype) :
   'select' resulted in 1:many mapping between keys and return rows
 > dim(z)
[1] 80172     5

 > zlst <- split(z, z[,1])
 > zlst[sapply(zlst, nrow) > 5][5]
$`16659407`
       PROBEID   SYMBOL               GENENAME ENTREZID         ENSEMBL
3821 16659407  PRAMEF5  PRAME family member 5   343068 ENSG00000204502
3822 16659407  PRAMEF5  PRAME family member 5   343068 ENSG00000232423
3823 16659407 PRAMEF23 PRAME family member 23   729368 ENSG00000232423
3824 16659407  PRAMEF6  PRAME family member 6   440561 ENSG00000232423
3825 16659407 PRAMEF15 PRAME family member 15   653619 ENSG00000157358
3826 16659407 PRAMEF15 PRAME family member 15   653619 ENSG00000204501
3827 16659407  PRAMEF9  PRAME family member 9   343070 ENSG00000204501
3828 16659407  PRAMEF9  PRAME family member 9   343070 ENSG00000157358
3829 16659407 PRAMEF11 PRAME family member 11   440560 ENSG00000204513
3830 16659407  PRAMEF4  PRAME family member 4   400735 ENSG00000243073

And how you deal with these one-to-many mappings is not trivial.

Best,

Jim

>
> (At present I have not done any filtering pre limma, used all 53K+ probes for DE).
>
>
> Many Thanks,
> Natasha
>
>   -- output of sessionInfo():
>
> --
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099