[BioC] Non-specific filtering for HyperGeometric/GSEA test

Wolfgang Huber whuber at embl.de
Wed May 12 00:50:54 CEST 2010


Dear Yuan

have a look into the manual page of "varFilter", which indicates that 
its 'var.cutoff' argument is interpreted as the quantile of the overall 
distribution of variances to be used as cutoff; whereas in your "code 
one" the "cutoff" is interpreted as the actual variance value to be used 
for the cutoff.

Try with
   selected <- (Iqr > quantile(Iqr, probs=cutoff))

the result of this should be nearly the same as with "code 2".

Why only "nearly"? You are right that "varFilter" does something odd 
when "var.func = IQR", namely it calls "rowIQRs", which runs a little 
bit faster, but produces a different result; you can verify this by 
typing "varFilter" and reading its code. (One might argue that the 
effort of understanding what this function does exceeds the effort of 
doing it from scratch...)

So, both code versions should produce nearly identical results, and the 
results of the downstream analysis (GSEA) should not depend sensitively 
on this.

	Best wishes
	Wolfgang

On 11/05/10 01:41, Yuan Hao wrote:
> Dear list,
>
> May I have a question about the non-specific filtering used for defining a
> gene universe during HyperGeometric/GSEA test?
>
> I have fifteen samples from Affymetrix. To remove probe sets that have
> little variation across samples, I evaluated IQR of each probe set across
> samples by either of the following two pieces of code:
>
> # code one
>> cutoff<- 0.5
>> Iqr<- apply (exprs(eset), 1, IQR)
>> selected<- (Iqr>  cutoff)
>> filtered<- eset[selected, ]
>> dim(filtered)
> Features  Samples
>   11490       15
>
> # code two
>> library(genefilter)
>> filtered<-varFilter(eset, var.func=IQR, var.cutoff=0.5,
> filterByQuantile=TRUE)
>> dim(filtered)
> Features  Samples
>   27337       15
>
> I realized the differences in "filtered" given by above two methods may
> come from the different definitions of IQR. In the first case, IQR was
> computed by using the 'quantile' function rather than Tukey's format:
> ‘IQR(x) = quantile(x,3/4) - quantile(x,1/4)’, which was used in the second
> case. I am aware the fact that the number of genes in the gene universe
> would has significant effects on the test result. However, I am not sure
> which IQR evaluation method will be a better choice for the
> HyperGeometric/GSEA test? It would be appreciated very much if you could
> shed some light on it!
>
> Regards,
> Yuan
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 


Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber



More information about the Bioconductor mailing list