[BioC] Non-specific filtering for HyperGeometric/GSEA test

Yuan Hao yuan.hao at cantab.net
Wed May 12 01:30:37 CEST 2010


Dear Wolfgang,

You are absolutely right in that I got the same result by trying your line
of code. Only forgot to indicate in the last email that I'd realized the
'quantile' interpretation in "varFilter". I tried turning off
'filterByQuantile' attribute (shown in the following code #3) in
"varFilter", but still got pretty different results compared to code #1,
which made me confused. Your explanation about the "rowIQRs" right hits on
my confusion and actually resolved the question. Thank you very much
again!

# code three
> filtered<-varFilter(eset, var.func=IQR,
var.cutoff=0.5,filterByQuantile=FALSE)
> dim(filtered)
Features  Samples
   18634       15

Best wishes,
Yuan



On 11 May 2010, at 23:50, Wolfgang Huber wrote:

Dear Yuan

have a look into the manual page of "varFilter", which indicates that its
'var.cutoff' argument is interpreted as the quantile of the overall
distribution of variances to be used as cutoff; whereas in your "code one"
the "cutoff" is interpreted as the actual variance value to be used for
the cutoff.

Try with
 selected <- (Iqr > quantile(Iqr, probs=cutoff))

the result of this should be nearly the same as with "code 2".

Why only "nearly"? You are right that "varFilter" does something odd when
"var.func = IQR", namely it calls "rowIQRs", which runs a little bit
faster, but produces a different result; you can verify this by typing
"varFilter" and reading its code. (One might argue that the effort of
understanding what this function does exceeds the effort of doing it from
scratch...)

So, both code versions should produce nearly identical results, and the
results of the downstream analysis (GSEA) should not depend sensitively on
this.

	Best wishes
	Wolfgang

On 11/05/10 01:41, Yuan Hao wrote:
Dear list,

May I have a question about the non-specific filtering used for defining a
gene universe during HyperGeometric/GSEA test?

I have fifteen samples from Affymetrix. To remove probe sets that have
little variation across samples, I evaluated IQR of each probe set across
samples by either of the following two pieces of code:

# code one
cutoff<- 0.5
Iqr<- apply (exprs(eset), 1, IQR)
selected<- (Iqr>  cutoff)
filtered<- eset[selected, ]
dim(filtered)
Features  Samples
 11490       15

# code two
library(genefilter)
filtered<-varFilter(eset, var.func=IQR, var.cutoff=0.5,
filterByQuantile=TRUE)
dim(filtered)
Features  Samples
 27337       15

I realized the differences in "filtered" given by above two methods may
come from the different definitions of IQR. In the first case, IQR was
computed by using the 'quantile' function rather than Tukey's format:
‘IQR(x) = quantile(x,3/4) - quantile(x,1/4)’, which was used in the second
case. I am aware the fact that the number of genes in the gene universe
would has significant effects on the test result. However, I am not sure
which IQR evaluation method will be a better choice for the
HyperGeometric/GSEA test? It would be appreciated very much if you could
shed some light on it!

Regards,
Yuan

_______________________________________________
Bioconductor mailing list
Bioconductor at stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 


Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber

_______________________________________________
Bioconductor mailing list
Bioconductor at stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list