[BioC] Problems with golubEsets dataset

Sun Oct 26 08:54:13 MET 2003

Dear list:
	When I begin to analysis the golubEsets dataset and make a simple pre-processing step,I find a strange phenomena.

	The pre-processing steps follows the suggestion of S. Dudoit et al.(2002, JASA,personal communication with Pablo Tamayo):(1) thresholding: floor of 100 and ceiling of 16000; (ii) filetering: exclusion of genes with max/min<=5 and (max-min)<=500, where max and min refer respectively to the maximum and minimum expression levels of a particular gene across mRNA samples;(iii) base 10 logarithmic transformation.

	If only pre-processing with thresholding,the dataset are summarized by a 7129*72 matrix, where there are 4260(0.784%) with values 16000,242087(47.164%) with values 100, totally 47.948%.

	If pre-processing with thresholding & filtering, the dataset are summarized by a 3571*72 matrix, where there are 987(0.384%) with values 16000, 50321(19.572%) with values 100, totally 19.956%.

	I wonder whether we can get some interesting expression pattern from such noisy dataset. I have written to the original author of the datasets,but unfortunately he cann't give me a good reason. I write this letter to the Bioconductor list to see if someone could give me a explanation.

	Waiting for reply! 
 	
				Wang Weiqiang
¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡cinderole at sina.com
¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡2003-10-26