[BioC] raw probe set expression data and masking questions

Mon Sep 11 17:38:10 CEST 2006

Hi Don,

I understand now - you're talking about "non-normalized" data; "raw" to me 
(& most people, I except) means untouched in any way. If you're using 
expresso() to do the pre-processing, you can set normalize=FALSE to skip 
any normalization, but do whatever background correction and summarization 
you want. I did just notice that if you use expresso with widget=TRUE to 
get the pop-up boxes, there's NOT an option to turn normalization off ; 
Rafa - is this something that should be changed?

BTW - almost everyone, including Affymetrix, agrees that the average of the 
PM-MM values (even using modified MM values, such as in the mas5 algorithm) 
is not the best summary measure because individual probes do differ in 
their binding efficiencies, such that they are not expected to have the 
same fluorescence level from the same transcript level.

Jenny

At 05:25 PM 9/8/2006, you wrote:
>By "raw" I was referring to expression summaries (such as average 
>difference) for a probe set (not individual probes) where the data have 
>not been scaled or normalized or adjusted in any way. So for each probe 
>set on each chip, I want the average of the PM-MM values that are in the 
>.CEL file prior to any chip to chip adjustments. Is that more clear?
>
>And thanks for the code on probe or probeset removal.
>
>Cheers,
>Don
>
>________________________________
>
>From: Jenny Drnevich [mailto:drnevich at uiuc.edu]
>Sent: Fri 9/8/2006 11:01 AM
>To: Schwartz, Donald; bioconductor at stat.math.ethz.ch
>Subject: Re: [BioC] raw probe set expression data and masking questions
>
>
>
>Hi Donald,
>
>At 07:42 AM 9/8/2006, Schwartz, Donald wrote:
> >Anyone know a simple method to harvest the raw probe set level expression
> >data (not individual PM probe or PM-MM probe pair level data), which will
> >be summarized by average difference, and then export that into excel with
> >the Affy ID or "probe set name" as row headers and columns representing
> >samples?
>
>I don't understand what you mean by 'raw probe set level expression'? The
>only raw values are the individual PM and MM probe values. There are
>various algorithms to combine them into a probe set level expression value,
>but by definition these are 'processed values', not 'raw' values.
>
> >  Second question, I have a small affy custom array. I want to select
> > certain probe sets for analysis prior to any low-level processing (i.e.,
> > prior to normalization and background correction, etc.). In other words,
> > I want to mask certain probe sets before normalization and background
> > correction. Following masking, the probe sets will be ignored for the
> > entire analysis. The difficulty is that there is no consistent naming
> > feature across all probe sets so the only way to identify probe sets to
> > mask or to analyze is with a character vector. In addition, I need the
> > flexibility to mask different sets of probe sets and then process
> > (normalize, bg.correct, etc) the data various ways. I'll use:
> > expresso(affybatch, widget=TRUE) to do this post masking.
> >
> >Thanks for your suggestions.
> >Don
>
>Below is some code to remove either individual probes or entire probesets.
>It was originally written by Ariel Chernomoretz and posted to the mailing
>list, and I have subsequently modified it slightly. I've tried to give
>explanations throughout - perhaps at some point I'll have time to make it
>into a small package. There's a chance it might not work with a custom affy
>array, as it assumes you have a cdf and a probe package for your chip named
>in the usual way: 'chipnamecdf' and 'chipnameprobe'. If you only want to
>remove entire probe sets, it might work even if you don't have a probe
>package for your custom array.
>
>Good luck,
>Jenny
>
>
>### The first part is just creating two ojects (ResetEnvir and
>RemoveProbes) originally
>### written by Ariel Chernomoretz and modified by Jenny Drnevich to remove
>individual
>### probes and/or entire probesets. Just highlight everything from here until
>### you see STOP and paste it to R all at once
>
>ResetEnvir<-function(cleancdf){
>   cdfpackagename   <- paste(cleancdf,"cdf",sep="")
>   probepackagename <- paste(cleancdf,"probe",sep="")
>   ll<-search()
>   cdfpackagepos <- grep(cdfpackagename,ll)
>   if(length(cdfpackagepos)>0) detach(pos=cdfpackagepos)
>   ll<-search()
>   probepackagepos <- grep(probepackagename,ll)
>   if(length(probepackagepos)>0) detach(pos=probepackagepos)
>   require(cdfpackagename,character.only=T)
>   require(probepackagename,character.only=T)
>   require(affy)
>}
>
>RemoveProbes<-function(listOutProbes=NULL,
>                         listOutProbeSets=NULL,
>
>cleancdf,destructive=TRUE){
>
>
>   #default probe dataset values
>   cdfpackagename   <- paste(cleancdf,"cdf",sep="")
>   probepackagename <- paste(cleancdf,"probe",sep="")
>   require(cdfpackagename,character.only = TRUE)
>   require(probepackagename,character.only = TRUE)
>   probe.env.orig <- get(probepackagename)
>
>
>   if(!is.null(listOutProbes)){
>    # taking probes out from CDF env
>    probes<- unlist(lapply(listOutProbes,function(x){
>                             a<-strsplit(x,"at")
>                             aux1<-paste(a[[1]][1],"at",sep="")
>                             aux2<-as.integer(a[[1]][2])
>                             c(aux1,aux2)
>                            }))
>    n1<-as.character(probes[seq(1,(length(probes)/2))*2-1])
>    n2<-as.integer(probes[seq(1,(length(probes)/2))*2])
>    probes<-data.frame(I(n1),n2)
>    probes[,1]<-as.character(probes[,1])
>    probes[,2]<-as.integer(probes[,2])
>    pset<-unique(probes[,1])
>    for(i in seq(along=pset)){
>     ii  <-grep(pset[i],probes[,1])
>     iout<-probes[ii,2]
>     a<-get(pset[i],env=get(cdfpackagename))
>     a<-a[-iout,]
>     assign(pset[i],a,env=get(cdfpackagename))
>    }
>   }
>
>
>   # taking probesets out from CDF env
>   if(!is.null(listOutProbeSets)){
>    rm(list=listOutProbeSets,envir=get(cdfpackagename))
>   }
>
>
>   # setting the PROBE env accordingly (idea from gcrma compute.affinities.R)
>   tmp <- get("xy2i",paste("package:",cdfpackagename,sep=""))
>   newAB   <- new("AffyBatch",cdfName=cleancdf)
>   pmIndex <-  unlist(indexProbes(newAB,"pm"))
>   subIndex<- match(tmp(probe.env.orig$x,probe.env.orig$y),pmIndex)
>   rm(newAB)
>   iNA     <- which(is.na(subIndex))
>
>
>   if(length(iNA)>0){
>    ipos<-grep(probepackagename,search())
>    assign(probepackagename,probe.env.orig[-iNA,],pos=ipos)
>   }
>}
>
>### STOP HERE!!!! PASTE THE ABOVE INTO R AND CHECK TO SEE YOU HAVE THE TWO
>OBJECTS
>### (ResetEnvir and RemoveProbes) IN YOUR WORKSPACE WITH ls()
>
># All you need now is your affybatch object, and a character vector of
>probe set names
># and/or another vector of individual probes that you want to remove. If
>your affybatch
># object is called 'rawdata' and the vector of probesets is 'maskedprobes',
>follow
># these steps:
>
>cleancdf <- cleancdfname(rawdata at cdfName,addcdf=FALSE)
>
># Make sure you are starting with the original cdf with all the probes and
>probesets.
>
>ResetEnvir(cleancdf)
>
># Double-check to make sure all probesets are present in your affybatch by
>typing in
># the name of your affybatch and looking at the output.
>
>rawdata
>
># To remove some probe sets (but not individual probes in this example), use:
>RemoveProbes(listOutProbes=NULL, listOutProbeSets=maskedprobes, cleancdf)
>
># The cdf file will be temporarily modified to mask the indicated probesets
>& probes,
># which you can check by typing in the name of your affybatch again and
>seeing that
># the number of probesets have decreased. The masking can be undone by
>using ResetEnvir
># as above, or by quitting the session. However, any Expression Set objects
>created
># when the cdf is modified will have the masked probesets removed
>permanently because
># they do not refer to the cdf like an affybatch object does.
>
>
>Jenny Drnevich, Ph.D.
>
>Functional Genomics Bioinformatics Specialist
>W.M. Keck Center for Comparative and Functional Genomics
>Roy J. Carver Biotechnology Center
>University of Illinois, Urbana-Champaign
>
>330 ERML
>1201 W. Gregory Dr.
>Urbana, IL 61801
>USA
>
>ph: 217-244-7355
>fax: 217-265-5066
>e-mail: drnevich at uiuc.edu
>
>
>
>
>
>
>------
>Confidentiality Notice: This email message, including any attachments, is 
>for the sole use of the intended recipient(s) and may contain confidential 
>and/or privileged information.  If you are not the intended recipient(s), 
>you are hereby notified that any dissemination, unauthorized review, use, 
>disclosure, copying, or distribution of the contents of this email and/or 
>any materials contained in any attachments is prohibited.  If you have 
>received this message in error, or are not the intended recipient(s), 
>please immediately notify the sender by email and destroy all copies of 
>the original message, including attachments.

Jenny Drnevich, Ph.D.

Functional Genomics Bioinformatics Specialist
W.M. Keck Center for Comparative and Functional Genomics
Roy J. Carver Biotechnology Center
University of Illinois, Urbana-Champaign

330 ERML
1201 W. Gregory Dr.
Urbana, IL 61801
USA

ph: 217-244-7355
fax: 217-265-5066
e-mail: drnevich at uiuc.edu