[BioC] Siggenes SAM analysis: log2 transformation and Understanding output

Fri Feb 17 14:46:52 CET 2012

Hi Jim,

Thank you for your thorough answers.

Just to make it absolutely clear: p0 is the chance that a randomly
picked gene is differentially expressed, when picking a gene from the
WHOLE data set?
Secondly, should I be wary of p0 values below/above some threshold? I
am working on multiple data sets, and p0 seems to vary from 0.39-0.80
Lastly, is there an option to specify a FDR threshold? I am running
through multiple data sets, and I would like to automate it, instead
of having to looking at a table for each one individually.

Best,
David

2012/2/16 James W. MacDonald <jmacdon at uw.edu>:
> Hi David,
>
>
> On 2/15/2012 8:30 AM, David Westergaard wrote:
>>
>> Hello,
>>
>> I am currently working on a data set about kiwi consumption for my
>> bachelors project. The data is available at
>> http://www.ebi.ac.uk/arrayexpress/experiments/E-MEXP-2030
>>
>> I'm abit confused as to how to interpret the output parameters,
>> specifically p0. I've run the following code:
>>
>> dataset<- read.table("OAS_RMA.txt",header=TRUE)
>> controls<-
>> cbind(dataset$CEL12.1,dataset$CEL13.1,dataset$CEL23.1,dataset$CEL25.1,dataset$CEL37.1,dataset$CEL59.1,dataset$CEL61.1,dataset$CEL78.1,dataset$CEL9.1,dataset$CEL92.1)
>> experiments<-
>> cbind(dataset$CEL18.1,dataset$CEL21.1,dataset$CEL3.1,dataset$CEL31.1,dataset$CEL46.1,dataset$CEL50.1,dataset$CEL56.1,dataset$CEL57.1,dataset$CEL7.1)
>>
>> library('siggenes')
>> datamatrix<- matrix(cbind(controls,experiments),ncol=19)
>> y<- rep(0,19)
>> y[11:19]<- 1
>> gene_names<- as.character(dataset$Hybridization.REF)
>> sam.obj = sam(datamatrix,y,gene.names=gene_names,rand=12345)
>>
>> Output:
>> AM Analysis for the Two-Class Unpaired Case Assuming Unequal Variances
>>
>>  s0 = 0
>>
>>  Number of permutations: 100
>>
>>  MEAN number of falsely called variables is computed.
>>
>>    Delta    p0    False Called    FDR cutlow cutup   j2    j1
>> 1    0.1 0.634 28335.89  37013 0.4851 -1.058 0.354 9709 27372
>> 2    0.5 0.634 11200.82  21273 0.3336 -2.271 0.910 2447 35850
>> 3    0.9 0.634   249.38   1522 0.1038 -3.374 3.088  541 53695
>> 4    1.3 0.634     9.67    134 0.0457 -4.402 5.577  127 54669
>> 5    1.7 0.634     0.69     20 0.0219 -5.596   Inf   20 54676
>> 6    2.1 0.634        0      1      0 -9.072   Inf    1 54676
>> 7    2.5 0.634        0      1      0 -9.072   Inf    1 54676
>> 8    2.9 0.634        0      1      0 -9.072   Inf    1 54676
>> 9    3.3 0.634        0      1      0 -9.072   Inf    1 54676
>> 10   3.7 0.634        0      0      0   -Inf   Inf    0 54676
>>
>> I'm using the rand parameter because results seems to vary a bit. p0
>> is in this case 0.634, and I'm not sure how to interpret this. From
>> literature, this is described as "Prior probability that a gene is not
>> differentially expressed" - What does this exactly mean? Does this
>> imply, that there is a ~63% percent chance, that the genes in
>> question, are actually NOT differentially expressed?
>
>
> It means that about 63% of your genes appear to be not differentially
> expressed. So if you choose a gene at random, there is a 63% probability
> that you will choose one that isn't differentially expressed.
>
> However, depending on the value of Delta that you choose, the expectation is
> that a far fewer percentage of the genes selected will be differentially
> expressed. In other words, you are trying to grab genes with a higher
> probability of differential expression, and you are then estimating what
> percentage of those genes are still likely false positives (e.g., if you
> choose a Delta of 1.3, you will get 134 significant genes, and will expect
> that about 10 of those will be false positives).
>
>
>> I've also found some varying sources saying that it is a good idea to
>> log2 transform data before inputting into SAM. Does this still apply,
>> and if so, why?
>
>
> This is because the t-test is based on means, which are not very robust to
> outliers. Gene expression data tend to have a strong right skew, meaning
> that most of the data are within a certain range, but there are some values
> much higher. If you take logs, it tends to minimize the skew, so the large
> values have less of an effect (on the linear scale, expression values range
> from 0-65,000, on log2 scale, they range from 0-16). It doesn't matter what
> base you use, but people have tended to use log base 2 because then a
> difference of 1 indicates a two-fold difference on the linear scale.
>
> Best,
>
> Jim
>
>
>>
>> Best Regards,
>>
>> David Westergaard
>> Undergraduate student
>> Technical University of Denmark
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor