[BioC] [SPAM???] Re: Masking single Oligos in mas5

Tue Dec 6 17:10:18 CET 2005

Hi Benjamin

I am sorry you have problems with the code.

Regarding the computation time, yes. It is slow even with small/mid sized 
lists, so in your case I would expect a loooot of processing time (more RAM 
would  help also ;-) ).  I guess that to improve this, the function should be 
completely rewritten using perhaps a different approach.

However, once  you succed in generating a modified cdf environment, and if you 
are planning to compute expressions many times having the same probes 
removed, you should keep (save) the modified cdf environment as an R object. 
I wrote a function (RemoveProbes2, included at the end) that takes the 
clean.cdf.environment as an input argument. So everything should be faster 
for subsequent calculations.

Unfortunately I am not a lot at the office these days. My postdoc is about to 
end here, and I am moving in a couple of weeks, so things are a little bit 
caotic this days! However I will check your code asap.

quick observations/questions:
1) are you removing entire probesets specifiyng the whole set of probes via 
the rmd$Probes list AND the rmd$Set? 
2) if you plan to use presence calls, be cautious. After the probe removal 
step you end up with probeset with different number of probes so be carefull 
with  the  significance levels of the Wilcoxon'rank test

Regards, 
Ariel./

################################################
#       clean.cdf: a modified cdf environment. Should be generated
#                         using RemoveProbes and saving the modified 
#                         environment afterwards
#       cdfpackagename:   (e.g. "hgu95av2cdf")
#       probepackagename: (e.g. "hgu95av2probe")
#       destructive: unimplemented option, see NOTE
RemoveProbes2<-function(clean.cdfenv    =NULL,
                        cdfpackagename,probepackagename,destructive=TRUE){

 #default probe dataset values
 probe.env.orig <- get(probepackagename)

 if(is.null(clean.cdfenv)){
  cat("A cdfenv should be provided\n")\
  return();
 }else{ #clean.cdf provided. much, much faster!
  ipos<-grep(cdfpackagename,search())
  rm(list=c(cdfpackagename),pos=ipos)
  assign(cdfpackagename,clean.cdfenv,pos=ipos)
 }

 # setting the PROBE env accordingly (idea from gcrma compute.affinities.R)
 tmp <- get("xy2i",paste("package:",cdfpackagename,sep=""))
 newAB   <- new("AffyBatch",cdfName=cleancdf)
 pmIndex <-  unlist(indexProbes(newAB,"pm"))
 subIndex<- match(tmp(probe.env.orig$x,probe.env.orig$y),pmIndex)
 rm(newAB)
 iNA     <- which(is.na(subIndex))

 if(length(iNA)>0){
  ipos<-grep(probepackagename,search())
  assign(probepackagename,probe.env.orig[-iNA,],pos=ipos)
 }
}

###############################################

On December 6, 2005 04:07 am, Benjamin Otto wrote:
> Hi folks,
>
> sorry for taking your time again. I let the computer finish the
> "removeProbes"-calculation over night, so I don't know how long he
> took...to finally display the following error message:
>
> "Error in ans[[i]][, i.probes] : incorrect number of dimensions"
>
> You can imagine how desperately I was looking for a shotgun that moment. To
> provide maybe more information I forgot to submit in my last thread here
> come the commands and extracts of my lists / structures I used:
>
> $# libraries used
> $> library(simpleaffy)
> $> library(affydata)
> $> require(gcrma)
> $
> $# preliminary commands
> $> setwd ("....")
> $> x <- read.affy()
> $> x at cdfName <- "HGU133Plus2"
> $> cleancdf <- cleancdfname(x at cdfName,addcdf=FALSE)
> $> cdfpackagename <- paste(cleancdf,"cdf",sep="")
> $> probepackagename <- paste(cleancdf,"probe",sep="")
> $> ResetEnvir(cdfpackagename,probepackagename)
> $
> $# my function for indentification of unwanted oligos
> $# I will append the function code below
> $>  rmd <- get.foul.oligos(x, ratio=2, diff=100)
> $
> $# the initial rma/mas calculation without changes
> $> d.rma <- rma(x)
> $> d.gcrma <- gcrma(x)
> $> d.mas <- mas5(x)
> $> d.mas.call <- mas5calls(x)
> $
> $# The command for invoking the removeProbes function
> $> RemoveProbes(rmd$Probes,rmd$Set,cdfpackagename,probepackagename)
>
>
> The probe list "rmd$Probes" contains 565825 entries, the set list "rmd$Set"
> 32791 entries. A little structure check:
>
> $> rmd$Set[1:5]
> $[1] "1007_s_at" "1053_at"   "1255_g_at" "1405_i_at" "1438_at"
> $> rmd$Probes[1:5]
> $[1] "1007_s_at1" "1007_s_at2" "1007_s_at3" "1007_s_at4" "1007_s_at5"
>
> Looks fine to me, but is it? What could cause the error message here ?
>
> Here comes the code of my function:
>
>
> get.foul.oligos <- function(x,ratio=1,diff=150,verbose=FALSE) {
>
> 	remove <- list()
> 	remove$Probes <- list()
> 	remove$Set <- list()
> 	len.plist <- 0
> 	len.slist <- 0
>
>
> 	ids <- geneNames(x)
> 	num.ids <- length(ids)
> 	num.samples <- length(pm(x)[1,])
> 	msk.table <- matrix(NA,num.ids,2)
> 	rm.string <- NA
>
> 	for (i in 1:num.ids) {
> 		id <- ids[i]
> 		oligos.pm <- indexProbes(x, which="pm", genenames=id)[[id]]
> 		oligos.mm <- indexProbes(x, which="mm", genenames=id)[[id]]
> 		num.oligos <- length(oligos.pm)
> 		num.rm <- 0
> 		for (j in 1:num.oligos) {
> 			cat ("i:",i," j:",j,"\n")
> 			rm.flag <- TRUE
> 			for (k in 1:num.samples) {
> 				val.pm <- intensity(x)[oligos.pm[j],k]
> 				val.mm <- intensity(x)[oligos.mm[j],k]
> 				if (val.pm/val.mm > ratio & val.pm-val.mm > diff) {
> 					rm.flag <- FALSE
> 				}
> 			}
> 			if (rm.flag == TRUE) {
> 				num.rm <- num.rm + 1
> 				oligo.name <- paste(id,j,sep="")
> 				if (len.plist == 1) {
> 					remove$Probes <- c(remove$Probes,oligo.name)
> 				}
> 				else { remove$Probes <- c(oligo.name) }
> 				len.plist <- 1
>
> 				if (is.na(rm.string)) { rm.string <- j }
> 				else { rm.string <- paste(rm.string,j,sep=",") }
> 			}
> 		}
> 		if (num.rm == num.oligos) {
> 			if (len.slist == 1) {
> 				remove$Set <- c(remove$Set,id)
> 			}
> 			else { remove$Set <- c(id) }
> 			len.slist <- 1
>
> 			rm.string <- "all"
> 		}
> 	}
> 	return(remove)
> }
>
> > -----Original Message-----
> > From: bioconductor-bounces at stat.math.ethz.ch
> > [mailto:bioconductor-bounces at stat.math.ethz.ch]On Behalf Of Benjamin
> > Otto
> > Sent: 05 December 2005 17:14
> > To: bioconductor at stat.math.ethz.ch
> > Subject: Re: [BioC] Masking single Oligos in mas5
> >
> >
> > Hi Ariel and Jenny,
> >
> > thanks very much for the quick reply!!! I have been experimenting
> > since with
> > the functions and it really seems to be what I'm looking for. A small
> > test with your test_script Ariel returned what I hoped for. However
> > there is one
> > aspect confusing me. After writing a little function myself,
> > which computes
> > the probe and probeset list which should be excluded in my case I applied
> > the removeProbes function on these lists. While I'm writing this
> > message my
> > computer is still calculating and he has started four hours ago! Is it
> > normal for the function to be so time consuming or is there some
> > chance that
> > I just have done something wrong? I suppose the problem is on my
> > side but I
> > just can't imagine what it could be because I followed the "testscript"
> > steps. But as the reading of the CEL files usually needs just a
> > few minutes
> > (maybe five if time consuming) I wouldn't have thought the removing of
> > probesets would take as long as it currently does. I must confess my
> > lists are really long. Still the problem is of big importance for me and
> > so it would be great to get an idea of your experience with the function
> > perfomance on your systems. Here is a little summary of the
> > conditions in my
> > case:
> >
> > Chiptype: HG_U133_Plus2
> > num. samples: 2
> > num. Probes in listOutProbes: ~ 500.000
> > num. Sets in listOutProbeSets: ~ 30.000
> > System: P4 - 3 GHZ, 500 MB RAM,
> > R-version: RGUI 2.2.0 under WindowsXP.
> >
> > Mainly I have been wondering that the removing of probes from the
> > structure
> > takes so much longer than reading in the CEL files and building the
> > structure. Has anyone of you maybe started a test before trying to remove
> > all probesets from an environment? Or can you at least judge wether the
> > current running time in my case is ok or looks suspicious?
> >
> > In any case just to make it clear again, thanks very much in any case.
> > The functions really seem to be of very, very much help!!! :-)
> >
> > Sincerly,
> > Benjamin
> >
> >
> >
> > -----Original Message-----
> > From: bioconductor-bounces at stat.math.ethz.ch
> > [mailto:bioconductor-bounces at stat.math.ethz.ch]On Behalf Of Ariel
> > Chernomoretz
> > Sent: 01 December 2005 17:32
> > To: bioconductor at stat.math.ethz.ch
> > Subject: Re: [BioC] Masking single Oligos in mas5
> >
> >
> > Hi Benjamin,
> >
> > Some time ago I wrote a couple of functions that modified
> > the cdf environments in order to remove bad probes from the affy analysis
> >
> > You could check:
> >
> > http://files.protsuggest.org/biocond/html/7350.html
> > http://files.protsuggest.org/biocond/html/7366.html
> > http://files.protsuggest.org/biocond/html/7367.html
> >
> >
> > hope that helps.
> >
> > ariel./
> >
> > On December 1, 2005 11:06 am, Benjamin Otto wrote:
> > > Hi,
> > >
> > > I have been searching for nearly two days for a solution to the
> >
> > following
> >
> > > problem without finding satisfactory answers:
> > >
> > > I am working on the analysis of a HG-U133_Plus2 Chip. Can I mask for
> > > certain probesets single Oligos such that the expression, p and fold
> >
> > change
> >
> > > values are calculated based on the remaining oligos?
> > >
> > > A better description of my problem and the background. We are handling
> > > a cross-species experiment having hybradized rna from tupaia on the
> > > human chip. This resulted in fairly low expression signals. If we just
> > > forget about all the other putative problems in analysing the result I
> > > think it seems reasonable to say, that in many cases only some of the
> > > probeset oligos will have hybridized satisfyingly. So the idea is
> > > masking some of the oligos by some criteria and calculate the results
> > > only based upon subsets of the probesets. The problem is: If I set even
> > > only one single oligo to NA, the values calculated for the
> > > corresponding
> >
> > probeset won't be
> >
> > > calculated but set to NA. Most of the threads I found concerning the
> > > masking problem handle the question of an autpmated or corrected form
> > > of masking. But there seems to be no available information about our
> > > case.
> >
> > Has
> >
> > > anyone done something like that before? I'm sure there will have to be
> >
> > some
> >
> > > manual programing. But the major question is: Does anybody see a
> > > possibility to mask the single oligos on a top level like fixing the
> > > affybatch structure? Or do I have to change every single
> >
> > function to treat
> >
> > > NA values in the correct form?
> > >
> > > thanks for your help,
> > > Benjamin
> > >
> > > _______________________________________________
> > > Bioconductor mailing list
> > > Bioconductor at stat.math.ethz.ch
> > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> >
> > --
> > Ariel Chernomoretz, Ph.D.
> > Centre de recherche du CHUL
> > 2705 Blv Laurier, bloc T-367
> > Sainte-Foy, Qc
> > G1V 4G2
> > (418)-525-4444 ext 46339
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor

-- 
Ariel Chernomoretz, Ph.D.
Centre de recherche du CHUL
2705 Blv Laurier, bloc T-367
Sainte-Foy, Qc
G1V 4G2
(418)-525-4444 ext 46339