[BioC] normalisation + sam

Fri Apr 23 18:58:26 CEST 2010

Hi Mike

I am afraid it does look like a normalisation problem. siggenes is 
right, given the data. Have a look at the heatmap in the PDF file at
   http://www-huber.embl.de/users/whuber/bioc-list/100423/
and the R script in the same directory, which derives from your code below.

	Best wishes
	Wolfgang

Mike Dewar scripsit 23/04/10 17:11:
> Hi Wolfgang,
> 
> To get from the CEL files to a Bioconductor ExpressionSet is quite 
> longwinded and, as it's not using Bioconductor, maybe not something I 
> should abuse this list with. Instead, I've put the expression set I've 
> generated up online at 
> 
> http://www.columbia.edu/~md2954/immgen.data
> 
> Then, to generate the results I'm seeing, I use the code below. I'm 
> pretty new to R, too, so any general hints also welcome. In the code 
> below I'm interested in seeing which genes that have symbols like "CD4" 
> are differentially expressed. Note that if I just use all the genes (and 
> cut out the code that selects "^Cd\\d+" genes) then I still get results 
> implying that my genes are not distributed in the way sam is expecting. 
> 
> To precisize my question then, does the result of SAM, as expressed in 
> the below code, imply that my array data is badly normalised? If not, 
> does it imply some horrific misunderstanding on my part on how these 
> things should work?
> 
> Mike
> 
> library(siggenes)
> library(Biobase)
> # PUT THE FOLDER WHERE YOU STORED immgen.data, i.e.
> #path = "/Users/mike/Data/Immgen/userData/GSE15907"
> path = "."
> # USE THE SEPARATOR APPROPRIATE FOR YOUR SYSTEM (must learn more about this)
> sep = "/"
> # load the data
> filename = "immgen.data"
> load(paste(path,filename,sep=sep)) # loads a ExpressionSet called immgen
> # form a vector which corresponds to the class of each array
> p = pData(immgen)
> cl = colnames(p)[apply(p, 1, which)]
> # remove "phenotype" genes
> # any gene that directly encodes a surface marker is highlighted
> regexp = "^Cd\\d+"
> marker_indices = grep(regexp,fData(immgen)$symbol)
> markers = immgen[marker_indices,]
> # class labels:
> classes <- colnames(pData(markers))
> # find out how many of each class we have
> count <- lapply(pData(markers),sum)
> # chuck out all the experiments with less than 4 examples. This will 
> leave just two classes. 
> to_keep = as.logical(sapply(
> as.data.frame(t(
> # the next line chooses those phenotypes with more than n examples
> # set n to 2 to leave 28 classes 
> n = 3
> pData(markers)[,classes[count>n]] 
> )),
> sum # the sum is for each row, hence the transposition above
> ))
> # now to_keep has a 1 next to each array we want and a 0 to those we don't
> # hence we can use to_keep to index the markers expression set
> markers <- markers[,to_keep]
> # and then chuck out classes we've rejected
> cl <- cl[to_keep]
> # then apply sam to the markers
> sam.out = sam(data=markers,cl=cl,var.equal=FALSE)
> print(sam.out, seq(0.1, 5, 0.1))
> plot(sam.out, 0.5)
> 
> 
> 
> On 22 Apr 2010, at 15:41, Wolfgang Huber wrote:
> 
>> Hi Mike
>>
>> can you provide a reproducible example (R script) and the output of 
>> sessionInfo() for the 2-groups comparison? This would include a 
>> pointer to the 6(?) CEL files on the web.
>>
>> For the 28-classes case, I doubt that hypothesis testing of the null 
>> hypothesis "mean in all classes is the same" is the most useful data 
>> analytic approach. Perhaps you can precisize your question.
>>
>> Best wishes
>> Wolfgang
>>
>>> Hi,
>>> I'm trying to use siggenes - sam() to look at differentially 
>>> expressed genes of data taken from the Immunological Genome project 
>>> (http://www.immgen.org/). A problem with this is that I have to 
>>> perform the preprocessing of the original CEL files as they do not 
>>> make the processed data available on GEO. To do this I'm using the 
>>> aroma suite of packages to perform quantile normalisation of this 
>>> data set (so far 128 arrays) in fixed memory (i.e. my laptop). This 
>>> is a good thing, as it has forced me to learn a little about array 
>>> preprocessing, and a bad thing as I'm new to all this and might be 
>>> going horribly wrong. When it comes time to look for differentially 
>>> expressed genes, I'm using siggenes - sam() and I'm getting some 
>>> strange results. I'm using (what I think would be considered) many 
>>> classes (28), where each class has at least 3 examples, and thus 
>>> throwing out some of the arrays. The results I'm getting look like:
>>> SAM Analysis for the Multi-Class Case with 28 Classes
>>>   Delta    p0    False Called      FDR
>>> 1    0.1 0.014 23355.05  23532 0.013997
>>> 2    0.2 0.014 21805.98  23498 0.013087
>>> 3    0.3 0.014 15923.83  23169 0.009693
>>> 4    0.4 0.014  9637.47  22343 0.006083
>>> 5    0.5 0.014  5527.75  21330 0.003655
>>> 6    0.6 0.014  3060.73  20221 0.002135
>>> 7    0.7 0.014  1703.82  19205 0.001251
>>> 8    0.8 0.014   953.02  18256 0.000736
>>> 9    0.9 0.014   536.81  17382 0.000436
>>> 10   1.0 0.014   307.19  16730 0.000259
>>> 11   1.1 0.014      176  16143 0.000154
>>> which I think implies that many many genes are differentially 
>>> expressed. Using plot(sam.out, 1.2) shows a pretty much vertical line 
>>> starting at the origin, with no genes observably behaving like the 
>>> null model. Even if I only try this on 2 classes, and hence throwing 
>>> out most of the data, I'm still not getting sensible results.
>>> Now I'm hoping that I'm doing something wrong, and that not 16K of my 
>>> genes are differentially expressed. However, I'm having difficulty 
>>> figuring out what it might be. The one striking thing between my data 
>>> set and the golub example set is that golub seems to be zero-mean - 
>>> is this a requirement for sam()?
>>> Any other ideas of what to look for, or what other information I 
>>> could provide to help this question make sense, would be greatly 
>>> appreciated.
>>> Thanks in advance,
>>> Mike Dewar
>>> - - -
>>> Dr Michael Dewar
>>> Postdoctoral Research Scientist Applied Mathematics
>>> Columbia University
>>> http://www.columbia.edu/~md2954/
>>> [[alternative HTML version deleted]]
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch <mailto:Bioconductor at stat.math.ethz.ch>
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: 
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> -- 
>>
>>
>> Wolfgang Huber
>> EMBL
>> http://www.embl.de/research/units/genome_biology/huber
>>
>>
>>
> 
> - - -
> Dr Michael Dewar
> **Postdoctoral Research Scientist 
> Applied Mathematics
> Columbia University
> http://www.columbia.edu/~md2954/
> 
> 
> 
> 
> 
> 

-- 

Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber