[BioC] [SPAM???] Re: Masking single Oligos in mas5

Tue Dec 6 18:54:11 CET 2005

Hi Ariel,

well I wish you a loooot of luck then for the pre-postdoc time. I hope you
like your new job. :-)

I think I have localized the problem for the function crash although I don't
quite understand it yet. I was just writing this text when your answer
reached me. Thanks!
Back to the crash: I will just give two examples for better explanation.

#Example 1: Include all probes of the probeset 117_at
# with exception of the first probe
# and don't delete the set itself
$> listOutProbes <- c("117_at2","117_at3","117_at4",
$+ "117_at5","117_at6","117_at7","117_at8","117_at9",
$+ "117_at10","117_at11","117_at12","117_at13",
$+ "117_at14","117_at15","117_at16")

$> sets <- NULL
$> ResetEnvir(cdfpackagename,probepackagename)
$> RemoveProbes(probes,sets,cdfpackagename,probepackagename)

#Example 2: Exapmle 1 again but ...
# ... delete the set from the list
$> listOutProbes <- c("117_at2","117_at3","117_at4",
$+ "117_at5","117_at6","117_at7","117_at8","117_at9",
$+ "117_at10","117_at11","117_at12","117_at13",
$+ "117_at14","117_at15","117_at16")

$> sets <- ("117_at")
$> ..
$> ..

ONLY the construction in example number 1!!! returns the error message:

$ Error in ans[[i]][, i.probes] : incorrect number of dimensions

and that when it reaches the command:

$ pmIndex <- unlist(indexProbes(newAB,"pm"))

in function "RemoveProbes". Note that the command "indexProbes(newAB,"pm")"
is causing the problem. That's why I snooped into the code of the method
"indexProbes" and voila ... the problem source is finally the command:

$ tmp <- as.vector(ans[[i]][, i.probes])

where "ans" was created by

$> envir <- getCdfInfo(newAB)
$> genenames <- ls(envir)
$> ans <- mget(genenames, envir, ifnotfound = NA)

NOW: If ans[[i]] contains only ONE value for each of "pm" and "mm" then

$> ans[[i]][,"pm"],
$> ans[[i]][,"mm"]
or ...
$> ans[[i]][,c(1,2)]

causes R to return the error message. I don't know whether the following
observation has to do with that or is rather neglectable:

$ # for the probeset with only one probe left I get:
$ Browse[1]> ans[[i]][1]
$     pm
$1051223

$ # for a probeset with more probes left I get:
$ Browse[1]> ans[[i]][1]
$ [1] 1051223

FINALLY: Sounds as if the bells should ring in ecstasy in my head telling me
what the problem really is, unfortunately I don't understand it yet. Maybe
someone has an idea whether this is some kind of expected problem which
could be corrected by the way "ans" is created. I think I will have to
experiment a little bit more...

regards,
Benjamin

> -----Original Message-----
> From: bioconductor-bounces at stat.math.ethz.ch
> [mailto:bioconductor-bounces at stat.math.ethz.ch]On Behalf Of Ariel
> Chernomoretz
> Sent: 06 December 2005 17:10
> To: bioconductor at stat.math.ethz.ch
> Subject: Re: [BioC] [SPAM???] Re: Masking single Oligos in mas5
>
>
>
>
> Hi Benjamin
>
> I am sorry you have problems with the code.
>
> Regarding the computation time, yes. It is slow even with small/mid sized
> lists, so in your case I would expect a loooot of processing time
> (more RAM
> would  help also ;-) ).  I guess that to improve this, the
> function should be
> completely rewritten using perhaps a different approach.
>
> However, once  you succed in generating a modified cdf
> environment, and if you
> are planning to compute expressions many times having the same probes
> removed, you should keep (save) the modified cdf environment as
> an R object.
> I wrote a function (RemoveProbes2, included at the end) that takes the
> clean.cdf.environment as an input argument. So everything should
> be faster
> for subsequent calculations.
>
>
> Unfortunately I am not a lot at the office these days. My postdoc
> is about to
> end here, and I am moving in a couple of weeks, so things are a
> little bit
> caotic this days! However I will check your code asap.
>
> quick observations/questions:
> 1) are you removing entire probesets specifiyng the whole set of
> probes via
> the rmd$Probes list AND the rmd$Set?
> 2) if you plan to use presence calls, be cautious. After the
> probe removal
> step you end up with probeset with different number of probes so
> be carefull
> with  the  significance levels of the Wilcoxon'rank test
>
>
>
> Regards,
> Ariel./
>
>
>
>
> ################################################
> #       clean.cdf: a modified cdf environment. Should be generated
> #                         using RemoveProbes and saving the modified
> #                         environment afterwards
> #       cdfpackagename:   (e.g. "hgu95av2cdf")
> #       probepackagename: (e.g. "hgu95av2probe")
> #       destructive: unimplemented option, see NOTE
> RemoveProbes2<-function(clean.cdfenv    =NULL,
>                         cdfpackagename,probepackagename,destructive=TRUE){
>
>  #default probe dataset values
>  probe.env.orig <- get(probepackagename)
>
>
>
>  if(is.null(clean.cdfenv)){
>   cat("A cdfenv should be provided\n")\
>   return();
>  }else{ #clean.cdf provided. much, much faster!
>   ipos<-grep(cdfpackagename,search())
>   rm(list=c(cdfpackagename),pos=ipos)
>   assign(cdfpackagename,clean.cdfenv,pos=ipos)
>  }
>
>
>  # setting the PROBE env accordingly (idea from gcrma
> compute.affinities.R)
>  tmp <- get("xy2i",paste("package:",cdfpackagename,sep=""))
>  newAB   <- new("AffyBatch",cdfName=cleancdf)
>  pmIndex <-  unlist(indexProbes(newAB,"pm"))
>  subIndex<- match(tmp(probe.env.orig$x,probe.env.orig$y),pmIndex)
>  rm(newAB)
>  iNA     <- which(is.na(subIndex))
>
>  if(length(iNA)>0){
>   ipos<-grep(probepackagename,search())
>   assign(probepackagename,probe.env.orig[-iNA,],pos=ipos)
>  }
> }
>
> ###############################################
>
>
>
> On December 6, 2005 04:07 am, Benjamin Otto wrote:
> > Hi folks,
> >
> > sorry for taking your time again. I let the computer finish the
> > "removeProbes"-calculation over night, so I don't know how long he
> > took...to finally display the following error message:
> >
> > "Error in ans[[i]][, i.probes] : incorrect number of dimensions"
> >
> > You can imagine how desperately I was looking for a shotgun
> that moment. To
> > provide maybe more information I forgot to submit in my last thread here
> > come the commands and extracts of my lists / structures I used:
> >
> > $# libraries used
> > $> library(simpleaffy)
> > $> library(affydata)
> > $> require(gcrma)
> > $
> > $# preliminary commands
> > $> setwd ("....")
> > $> x <- read.affy()
> > $> x at cdfName <- "HGU133Plus2"
> > $> cleancdf <- cleancdfname(x at cdfName,addcdf=FALSE)
> > $> cdfpackagename <- paste(cleancdf,"cdf",sep="")
> > $> probepackagename <- paste(cleancdf,"probe",sep="")
> > $> ResetEnvir(cdfpackagename,probepackagename)
> > $
> > $# my function for indentification of unwanted oligos
> > $# I will append the function code below
> > $>  rmd <- get.foul.oligos(x, ratio=2, diff=100)
> > $
> > $# the initial rma/mas calculation without changes
> > $> d.rma <- rma(x)
> > $> d.gcrma <- gcrma(x)
> > $> d.mas <- mas5(x)
> > $> d.mas.call <- mas5calls(x)
> > $
> > $# The command for invoking the removeProbes function
> > $> RemoveProbes(rmd$Probes,rmd$Set,cdfpackagename,probepackagename)
> >
> >
> > The probe list "rmd$Probes" contains 565825 entries, the set
> list "rmd$Set"
> > 32791 entries. A little structure check:
> >
> > $> rmd$Set[1:5]
> > $[1] "1007_s_at" "1053_at"   "1255_g_at" "1405_i_at" "1438_at"
> > $> rmd$Probes[1:5]
> > $[1] "1007_s_at1" "1007_s_at2" "1007_s_at3" "1007_s_at4" "1007_s_at5"
> >
> > Looks fine to me, but is it? What could cause the error message here ?
> >
> > Here comes the code of my function:
> >
> >
> > get.foul.oligos <- function(x,ratio=1,diff=150,verbose=FALSE) {
> >
> > 	remove <- list()
> > 	remove$Probes <- list()
> > 	remove$Set <- list()
> > 	len.plist <- 0
> > 	len.slist <- 0
> >
> >
> > 	ids <- geneNames(x)
> > 	num.ids <- length(ids)
> > 	num.samples <- length(pm(x)[1,])
> > 	msk.table <- matrix(NA,num.ids,2)
> > 	rm.string <- NA
> >
> > 	for (i in 1:num.ids) {
> > 		id <- ids[i]
> > 		oligos.pm <- indexProbes(x, which="pm", genenames=id)[[id]]
> > 		oligos.mm <- indexProbes(x, which="mm", genenames=id)[[id]]
> > 		num.oligos <- length(oligos.pm)
> > 		num.rm <- 0
> > 		for (j in 1:num.oligos) {
> > 			cat ("i:",i," j:",j,"\n")
> > 			rm.flag <- TRUE
> > 			for (k in 1:num.samples) {
> > 				val.pm <- intensity(x)[oligos.pm[j],k]
> > 				val.mm <- intensity(x)[oligos.mm[j],k]
> > 				if (val.pm/val.mm > ratio &
> val.pm-val.mm > diff) {
> > 					rm.flag <- FALSE
> > 				}
> > 			}
> > 			if (rm.flag == TRUE) {
> > 				num.rm <- num.rm + 1
> > 				oligo.name <- paste(id,j,sep="")
> > 				if (len.plist == 1) {
> > 					remove$Probes <-
> c(remove$Probes,oligo.name)
> > 				}
> > 				else { remove$Probes <- c(oligo.name) }
> > 				len.plist <- 1
> >
> > 				if (is.na(rm.string)) { rm.string <- j }
> > 				else { rm.string <-
> paste(rm.string,j,sep=",") }
> > 			}
> > 		}
> > 		if (num.rm == num.oligos) {
> > 			if (len.slist == 1) {
> > 				remove$Set <- c(remove$Set,id)
> > 			}
> > 			else { remove$Set <- c(id) }
> > 			len.slist <- 1
> >
> > 			rm.string <- "all"
> > 		}
> > 	}
> > 	return(remove)
> > }
> >
> > > -----Original Message-----
> > > From: bioconductor-bounces at stat.math.ethz.ch
> > > [mailto:bioconductor-bounces at stat.math.ethz.ch]On Behalf Of Benjamin
> > > Otto
> > > Sent: 05 December 2005 17:14
> > > To: bioconductor at stat.math.ethz.ch
> > > Subject: Re: [BioC] Masking single Oligos in mas5
> > >
> > >
> > > Hi Ariel and Jenny,
> > >
> > > thanks very much for the quick reply!!! I have been experimenting
> > > since with
> > > the functions and it really seems to be what I'm looking for. A small
> > > test with your test_script Ariel returned what I hoped for. However
> > > there is one
> > > aspect confusing me. After writing a little function myself,
> > > which computes
> > > the probe and probeset list which should be excluded in my
> case I applied
> > > the removeProbes function on these lists. While I'm writing this
> > > message my
> > > computer is still calculating and he has started four hours ago! Is it
> > > normal for the function to be so time consuming or is there some
> > > chance that
> > > I just have done something wrong? I suppose the problem is on my
> > > side but I
> > > just can't imagine what it could be because I followed the
> "testscript"
> > > steps. But as the reading of the CEL files usually needs just a
> > > few minutes
> > > (maybe five if time consuming) I wouldn't have thought the removing of
> > > probesets would take as long as it currently does. I must confess my
> > > lists are really long. Still the problem is of big importance
> for me and
> > > so it would be great to get an idea of your experience with
> the function
> > > perfomance on your systems. Here is a little summary of the
> > > conditions in my
> > > case:
> > >
> > > Chiptype: HG_U133_Plus2
> > > num. samples: 2
> > > num. Probes in listOutProbes: ~ 500.000
> > > num. Sets in listOutProbeSets: ~ 30.000
> > > System: P4 - 3 GHZ, 500 MB RAM,
> > > R-version: RGUI 2.2.0 under WindowsXP.
> > >
> > > Mainly I have been wondering that the removing of probes from the
> > > structure
> > > takes so much longer than reading in the CEL files and building the
> > > structure. Has anyone of you maybe started a test before
> trying to remove
> > > all probesets from an environment? Or can you at least judge
> wether the
> > > current running time in my case is ok or looks suspicious?
> > >
> > > In any case just to make it clear again, thanks very much in any case.
> > > The functions really seem to be of very, very much help!!! :-)
> > >
> > > Sincerly,
> > > Benjamin
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: bioconductor-bounces at stat.math.ethz.ch
> > > [mailto:bioconductor-bounces at stat.math.ethz.ch]On Behalf Of Ariel
> > > Chernomoretz
> > > Sent: 01 December 2005 17:32
> > > To: bioconductor at stat.math.ethz.ch
> > > Subject: Re: [BioC] Masking single Oligos in mas5
> > >
> > >
> > > Hi Benjamin,
> > >
> > > Some time ago I wrote a couple of functions that modified
> > > the cdf environments in order to remove bad probes from the
> affy analysis
> > >
> > > You could check:
> > >
> > > http://files.protsuggest.org/biocond/html/7350.html
> > > http://files.protsuggest.org/biocond/html/7366.html
> > > http://files.protsuggest.org/biocond/html/7367.html
> > >
> > >
> > > hope that helps.
> > >
> > > ariel./
> > >
> > > On December 1, 2005 11:06 am, Benjamin Otto wrote:
> > > > Hi,
> > > >
> > > > I have been searching for nearly two days for a solution to the
> > >
> > > following
> > >
> > > > problem without finding satisfactory answers:
> > > >
> > > > I am working on the analysis of a HG-U133_Plus2 Chip. Can I mask for
> > > > certain probesets single Oligos such that the expression, p and fold
> > >
> > > change
> > >
> > > > values are calculated based on the remaining oligos?
> > > >
> > > > A better description of my problem and the background. We
> are handling
> > > > a cross-species experiment having hybradized rna from tupaia on the
> > > > human chip. This resulted in fairly low expression signals.
> If we just
> > > > forget about all the other putative problems in analysing
> the result I
> > > > think it seems reasonable to say, that in many cases only
> some of the
> > > > probeset oligos will have hybridized satisfyingly. So the idea is
> > > > masking some of the oligos by some criteria and calculate
> the results
> > > > only based upon subsets of the probesets. The problem is:
> If I set even
> > > > only one single oligo to NA, the values calculated for the
> > > > corresponding
> > >
> > > probeset won't be
> > >
> > > > calculated but set to NA. Most of the threads I found concerning the
> > > > masking problem handle the question of an autpmated or
> corrected form
> > > > of masking. But there seems to be no available information about our
> > > > case.
> > >
> > > Has
> > >
> > > > anyone done something like that before? I'm sure there will
> have to be
> > >
> > > some
> > >
> > > > manual programing. But the major question is: Does anybody see a
> > > > possibility to mask the single oligos on a top level like fixing the
> > > > affybatch structure? Or do I have to change every single
> > >
> > > function to treat
> > >
> > > > NA values in the correct form?
> > > >
> > > > thanks for your help,
> > > > Benjamin
> > > >
> > > > _______________________________________________
> > > > Bioconductor mailing list
> > > > Bioconductor at stat.math.ethz.ch
> > > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > >
> > > --
> > > Ariel Chernomoretz, Ph.D.
> > > Centre de recherche du CHUL
> > > 2705 Blv Laurier, bloc T-367
> > > Sainte-Foy, Qc
> > > G1V 4G2
> > > (418)-525-4444 ext 46339
> > >
> > > _______________________________________________
> > > Bioconductor mailing list
> > > Bioconductor at stat.math.ethz.ch
> > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > >
> > > _______________________________________________
> > > Bioconductor mailing list
> > > Bioconductor at stat.math.ethz.ch
> > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
>
> --
> Ariel Chernomoretz, Ph.D.
> Centre de recherche du CHUL
> 2705 Blv Laurier, bloc T-367
> Sainte-Foy, Qc
> G1V 4G2
> (418)-525-4444 ext 46339
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
>