[BioC] observations on affyprobeminer

Sun May 4 11:46:41 CEST 2008

> I have recently explored the use of alternative CDFs from affyprobeminer
(APM) or a 36 array dataset derived using the Affy rat2302 chipset. I
used
> both the Affy cdf and the transcript-level affyprobeminer cdf. I
preprocessed using RMA, filtered using an A/P filter, and statistically
analyzed using an appropriate lme model followed by qvalue FDR
correction.
> I
> set my FDR threshold at 5%. I eliminated duplicate genes by picking the one
> with the lowest p-value.
>
> Using the Affy cdf, I got ~2000 sig. genes, which APM ~1000. If I choose
only those EntrezGene identifiers present on both cdfs, my number sig.
with
> the APM cdf was ~1000 and there was a 90% overlap with the Affy sig.
list.
> My conclusion from the latter observation is that I am measuring largely
the
> same transcripts/genes with both CDFs.
>
> I was interested in the ~1000 genes which are annotated with the Affy
CDF
> but not the APM cdf. Following the logic behind APM, I would assume that
these would be largely incorrectly annotated probesets or probesets that
are
> not really measuring any "real" transcript. This list should, then, consist
> largely of random genes. To test this hypothesis, I used the Category
package to test for over-representation of GO and KEGG categories in my
various lists. What I found was a huge degree of overlap between: 1. the
affy genes also annotated with APM, 2. the affy genes not annotated with
APM, 3. the genes derived solely from APM.
>
> My conclusion from this latest observation is that APM is not annotating
a
> large number of genes/transcripts that are in fact real. Assuming that
APM
> is correctly throwing out some "junk" probesets, is it throwing out the
baby
> with the bathwater?

Not necessarily.

With Affymetrix mappings, there are a large number of cases from which
there are multiple probesets for a "gene" (in the example below
with hgu133a, that represents 20% of the probesets), and those probesets
can be collapsed into one when remapping.

Here is an example with few probesets (the example is mostly a copy-paste
from one of the examples in the vignette "altcdfenvs"):

geneSymbols <- c("IGKC", "IL8", "NENF", "TRIO")

# Count the probesets associated with our geneSymbols
library(hgu133a.db)
sapply(geneSymbols,
       function(x) length(mappedkeys(subset(hgu133aSYMBOL, Rkeys=x))))
# This returns:
#IGKC  IL8 NENF TRIO
#  15   12    6    9
# Which means that there are 9 probesets for TRIO, 6 for NENF, etc...

# Now lets check what comes out of remapping
library(altcdfenvs)
library(biomaRt)
mart <- useMart("ensembl",dataset="hsapiens_gene_ensembl")

getSeq <- function(name) {
  seq <- getSequence(id=name, type="hgnc_symbol",
                     seqType="cdna", mart = mart)

  targets <- seq$cdna
  if (is.null(targets))
    return(character(0))
  names(targets) <- paste(seq$hgnc_symbol, 1:nrow(seq), sep="-")
  return(targets)
}

targets <- unlist(lapply(geneSymbols,
                         getSeq))
m <- matchAffyProbes(hgu133aprobe, targets, "HG-U133A")

hg <- toHypergraph(m)

gn <- toGraphNEL(hg)

library(RColorBrewer)
col <- brewer.pal(length(geneSymbols)+1, "Set1")
tColors <- rep(col[length(col)], length=numNodes(gn))
names(tColors) <- nodes(gn)
for (col_i in 1:(length(col)-1)) {
  node_i <- grep(paste("^", geneSymbols[col_i],
                       "-", sep=""),
                       names(tColors))
  tColors[node_i] <- col[col_i]
}

nAttrs <- list(fillcolor = tColors)

library(Rgraphviz)
plot(gn, "twopi", nodeAttrs=nAttrs)

# the plot will show that the situation is not as simple for TRIO as
# as it is with the other gene symbols.

> I'd be interested to hear the thoughts and experiences of others. I've
certainly run into occasions where Affy annotated probesets turn out to
represent introns or something other than they purport to be, and I was
hoping that APM would solve this problem, but I don't want to use it if
it
> means a massive loss of truly significant data.

The situation is indeed not always clear... at the moment, I would not
advice you to follow blindly any particular mapping, yet have alternative
mappings as part of your routine analysis: depending on the cost of
follow-up experiments, or downstream analysis, time should be spent
looking at probesets in details.

Hoping this helps,

L.

> Mark
>
>
>
> --
> Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry
> Indiana University School of Medicine
>
> 15032 Hunter Court, Westfield, IN 46074
>
> (317) 490-5129 Work, & Mobile & VoiceMail
> (317) 663-0513 Home (no voice mail please)
>
> **************************************************************
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>