[BioC] Assigning gene symbols to Affymetrix data and averaging probes

Wed Oct 3 17:30:34 CEST 2012

Hi Lesley,

On 10/3/2012 10:55 AM, Hoyles, Lesley wrote:
>  Hi
>
>  I have processed my affy data and am able to annotate the object
>  mice.loess using the following. ID <- featureNames(mice.loess) Symbol
>  <- getSYMBOL(ID,'mouse4302.db') fData(mice.loess) <-
>  data.frame(ID=ID,Symbol=Symbol)
>
>
>  However, when I convert my object as follows - expr.loess <-
>  exprs(mice.loess) - I lose the annotation and have been unable to
>  find a way to annotate expr.loess. Please could anybody suggest how I
>  can annotate expr.loess?
expr.loess <- data.frame(ID = ID, Symbol = Symbol, exprs(mice.loess))

>
>
>  Is there a way of averaging probes for each gene with Affymetrix
>  data? I've been able to do this with single-channel Agilent data
>  using the example given in the limma guide.

There are probably two reasonable ways to do this. First, the easiest.

dat <- ReadAffy(cdfname = "mouse4302mmentrezcdf")

and proceed from there. This will use the MBNI re-mapped CDF package 
based on Entrez Gene IDs, and you will have a single value per gene 
after summarization. There are other ways to map the probes; see 
http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/CDF_download.asp 
at the bottom of the page for more info.

Alternatively if you want to stick with the original probesets, the 
problem arises that some probesets are not well annotated, so what to do 
with those? In addition, gene symbols are not guaranteed to be unique, 
so you can't just assume that they are. Entrez Gene and UniGene IDs are 
supposed to be unique, so you could go with them, doing something like 
(untested)

gns <- toTable(mouse4302ENTREZID)
alldat <- merge(gns, expr.loess, by = 1) ## where expr.loess is the 
data.frame I suggest above
alldatlst <- tapply(1:nrow(alldat), alldat$gene_id, function(x) alldat[x,])
combined.data <- do.call("rbind", lapply(alldatlst, function(x) 
c(x[1,1:3], colMeans(x[,-c(1:3)])))

Here I am assuming that after the merge() step the first three columns 
are the probeset ID, gene_id, symbol, and the remaining columns are the 
expression values. You will lose all data for which there isn't an 
Entrez Gene ID, but the same is true of the MBNI method I outline above.

Best,

Jim

>
>
>  Thanks in advance for your help.
>
>  Best wishes Lesley _______________________________________________
>  Bioconductor mailing list Bioconductor at r-project.org
>  https://stat.ethz.ch/mailman/listinfo/bioconductor Search the
>  archives:
>  http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099