[BioC] nsFilter and GSEA

Fri Jan 11 21:34:34 CET 2008

Hi Paolo,
   Thanks for doing that, and an apology as I misremembered having 
ported a bug fix back to the release branch.  This is a bug(let) that is 
fixed in the devel version, and that I will shortly port back to release.

   The problem is that the 0.5 default in the call to nsFilter was meant 
to be interpreted as a quantile, not as a value, but the in the 
implementation it was implemented as a value.

   So, my recommendation is to do something like

   rmaIQRs = apply(exprs(eset.mas), 1, IQR)
   med1 = median(rmaIQRs)

   nsFilter(eset.mas, var.cutoff = med1)

  and then something similar for the rma version, and then things should 
line up somewhat better.

   let us know if that is not clear, or if anything else comes up

  Robert

Paolo Innocenti wrote:
> Dear Robert and BioC Mailing list,
> 
> The chips are Affymetrix Drosophila genome 1.0 (annotation drosgenome1).
> I am even more confused: to make sure that was not my fault, I copied 
> the .CEL files in a new directory, started a fresh R session from there 
> and run *just* the following code. Same results:
> 
>  > library(affy)
> Loading required package: Biobase
> Loading required package: tools
> 
> Welcome to Bioconductor
> 
>    Vignettes contain introductory material. To view, type
>    'openVignette()'. To cite Bioconductor, see
>    'citation("Biobase")' and for packages 'citation(pkgname)'.
> 
> Loading required package: affyio
> Loading required package: preprocessCore
>  > mydata <- ReadAffy()
>  > eset.rma <- rma(mydata)
> Background correcting
> Normalizing
> Calculating Expression
>  > eset.mas <- mas5(mydata)
> background correction: mas
> PM/MM correction : mas
> expression values: mas
> background correcting...done.
> 14010 ids to be processed
> |                    |
> |####################|
>  > library(genefilter)
> Loading required package: survival
> Loading required package: splines
>  > eset.rma.f <- nsFilter(eset.rma)
>  > eset.mas.f <- nsFilter(eset.mas)
>  > eset.rma.f
> $eset
> ExpressionSet (storageMode: lockedEnvironment)
> assayData: 171 features, 15 samples
>    element names: exprs
> phenoData
>    sampleNames: dta_2a.CEL, dta_2b.CEL, ..., virgin_4b.CEL  (15 total)
>    varLabels and varMetadata description:
>      sample: arbitrary numbering
> featureData
>    featureNames: 147260_at, 142359_at, ..., 145988_at  (171 total)
>    fvarLabels and fvarMetadata description: none
> experimentData: use 'experimentData(object)'
> Annotation: drosgenome1
> 
> $filter.log
> $filter.log$numDupsRemoved
> [1] 3
> 
> $filter.log$numLowVar
> [1] 13047
> 
> $filter.log$feature.exclude
> [1] 3
> 
> $filter.log$numRemoved.ENTREZID
> [1] 786
> 
> 
>  > eset.mas.f
> $eset
> ExpressionSet (storageMode: lockedEnvironment)
> assayData: 12122 features, 15 samples
>    element names: exprs, se.exprs
> phenoData
>    sampleNames: dta_2a.CEL, dta_2b.CEL, ..., virgin_4b.CEL  (15 total)
>    varLabels and varMetadata description:
>      sample: arbitrary numbering
> featureData
>    featureNames: 153135_at, 154994_at, ..., 152360_at  (12122 total)
>    fvarLabels and fvarMetadata description: none
> experimentData: use 'experimentData(object)'
> Annotation: drosgenome1
> 
> $filter.log
> $filter.log$numDupsRemoved
> [1] 1098
> 
> $filter.log$numLowVar
> [1] 1
> 
> $filter.log$feature.exclude
> [1] 3
> 
> $filter.log$numRemoved.ENTREZID
> [1] 786
> 
> 
>  > sessionInfo()
> R version 2.6.1 (2007-11-26)
> i486-pc-linux-gnu
> 
> locale:
> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] splines   tools     stats     graphics  grDevices utils     datasets
> [8] methods   base
> 
> other attached packages:
> [1] drosgenome1_2.0.1    genefilter_1.16.0    survival_2.34
> [4] drosgenome1cdf_2.0.0 affy_1.16.0          preprocessCore_1.0.0
> [7] affyio_1.6.1         Biobase_1.16.1
> 
> loaded via a namespace (and not attached):
> [1] annotate_1.16.1     AnnotationDbi_1.0.6 DBI_0.2-4
> [4] rcompgen_0.1-17     RSQLite_0.6-4
>  >
> 
> 
> Could be the CEL files that are damaged?
> Thanks,
> best wishes,
> Paolo
> 
> 
> 
> Robert Gentleman wrote:
>> Hi,
>>  It looks like something fairly odd is going on, and that we are not 
>> seeing all of the code that is being run.
>>
>>  What chip are you using?  What is very odd is that in your first 
>> example 1098 "duplicate" probes are found, but in the second run only 3. 
>> Basically this cannot happen (since the probes are the same) and 
>> suggests that some piece of code has manipulated the names, and at that 
>> point I think fairly bad things are going to happen. So this would be 
>> one place to try and fix things.
>>
>>  Second, nsFilter filters by default at the median, so you should retain 
>> about 0.5 of your probe sets. But since you loose so many (you didn't 
>> tell us the chip so I can't be sure) but it looks like all of the values 
>> are corrupt for that example as well.
>>
>>  So, I think that you are looking in the wrong place. Your problem is 
>> probably earlier on.
>>
>>  best wishes
>>    Robert
>>
>>
>> Paolo Innocenti wrote:
>>> Hi again,
>>>
>>> I tried with a different normalisation method, and I was pretty 
>>> surprised by the results:
>>>
>>>  > eset.mas <- mas5(mydata)
>>> background correction: mas
>>> PM/MM correction : mas
>>> expression values: mas
>>> background correcting...done.
>>> 14010 ids to be processed
>>> |                    |
>>> |####################|
>>>  > eset.mas.f <- nsFilter(eset.mas)
>>>  > eset.mas.f$filter.log
>>> $numDupsRemoved
>>> [1] 1098
>>>
>>> $numLowVar
>>> [1] 1
>>>
>>> $feature.exclude
>>> [1] 3
>>>
>>> $numRemoved.ENTREZID
>>> [1] 786
>>>
>>>  > eset.rma <- rma(mydata)
>>> Background correcting
>>> Normalizing
>>> Calculating Expression
>>>  > eset.rma.f <- nsFilter(eset.rma)
>>>  > eset.rma.f$filter.log
>>> $numDupsRemoved
>>> [1] 3
>>>
>>> $numLowVar
>>> [1] 13047
>>>
>>> $feature.exclude
>>> [1] 3
>>>
>>> $numRemoved.ENTREZID
>>> [1] 786
>>>
>>>  > dim(eset.rma.f$eset)
>>> Features  Samples
>>>       171       15
>>>  > dim(eset.mas.f$eset)
>>> Features  Samples
>>>     12122       15
>>>
>>> I don't understand how is it possible. Any suggestion about what to 
>>> do? Should I lower the cutoff for the rma, or that processing method 
>>> doesn't work for my dataset?
>>>
>>> Paolo
>>> PS: I tried also a really low cutoff, but the situation doesn't 
>>> change, unless I choose a cutoff=0.1:
>>>
>>>  > eset.filter <- nsFilter(eset,var.cutoff=0.2)
>>>  > eset.filter$filter.log
>>> $numDupsRemoved
>>> [1] 69
>>>
>>> $numLowVar
>>> [1] 10560
>>>
>>> $feature.exclude
>>> [1] 3
>>>
>>> $numRemoved.ENTREZID
>>> [1] 786
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: 
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 

-- 
Robert Gentleman, PhD
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
PO Box 19024
Seattle, Washington 98109-1024
206-667-7700
rgentlem at fhcrc.org