[BioC] Siggenes SAM analysis: log2 transformation and Understanding output

Sat Feb 18 13:21:24 CET 2012

Hi Jim, hi David,

thanks a lot, Jim, for answering these questions. Just one or two comments:

> > Lastly, is there an option to specify a FDR threshold? I am running
> > through multiple data sets, and I would like to automate it, instead
> > of having to looking at a table for each one individually.
> 

> I don't see how you could automate things with siggenes. It is designed 
> to be run interactively.
> 

There exists a function called findDelta in siggenes, which allows you to find the value of delta and thus the set of genes for which the FDR is below some specified threshold. So if you really wanna automate your analysis by specifyig a specific threshold for the FDR (which is actually not really the idea behind SAM), then you can do this by using findDelta.

(Alternatively, you can use the q-values that are computed for all genes -- available via the slot q.value, i.e. sam.out at q.value -- to select all genes with a q-value, i.e. FDR adjusted p-value, below some threshold. Please note the set of genes might slightly differ between the SAM FDR and the q-value approach, mainly because in the former approach the thresholds are not necessary symmetric to the origin.)

> limma package, which estimates FDR using p.adjust(), rather than via 
> permutation. So if you are interested in automation, particularly 
> automating the annotation/output side of things, take a look at 
> affycoretools.
> </shameless plug>

Don't think that this is a shameless plug. Just some nice hints to other great, helpful packages. 

Just wanted to note that you can also use the results of a limma analysis in siggenes. There exists a function called limma2sam in siggenes which allows you to perform a SAM analysis with the limma statistic.

Moreover, SAM does not really rely on permutations. E.g., all the stuff in siggenes for SNPs uses asymptotic null distributions. If you wanna use the moderated t-statistic proposed in the original SAM paper, then you need permutations. But it is also possible to use the ordinary parametric t-test (which might make sense if you have a sufficient number of samples). I have not directly implemented this in siggenes to make things not more confusing. But code for a parametric t is available in the siggenes vignette and can thus be extracted from it.

Best,
Holger

> 
> Best,
> 
> Jim
> 
> 
> >
> > Best,
> > David
> >
> >
> >
> > 2012/2/16 James W. MacDonald<jmacdon at uw.edu>:
> >> Hi David,
> >>
> >>
> >> On 2/15/2012 8:30 AM, David Westergaard wrote:
> >>> Hello,
> >>>
> >>> I am currently working on a data set about kiwi consumption for my
> >>> bachelors project. The data is available at
> >>> http://www.ebi.ac.uk/arrayexpress/experiments/E-MEXP-2030
> >>>
> >>> I'm abit confused as to how to interpret the output parameters,
> >>> specifically p0. I've run the following code:
> >>>
> >>> dataset<- read.table("OAS_RMA.txt",header=TRUE)
> >>> controls<-
> >>>
> cbind(dataset$CEL12.1,dataset$CEL13.1,dataset$CEL23.1,dataset$CEL25.1,dataset$CEL37.1,dataset$CEL59.1,dataset$CEL61.1,dataset$CEL78.1,dataset$CEL9.1,dataset$CEL92.1)
> >>> experiments<-
> >>>
> cbind(dataset$CEL18.1,dataset$CEL21.1,dataset$CEL3.1,dataset$CEL31.1,dataset$CEL46.1,dataset$CEL50.1,dataset$CEL56.1,dataset$CEL57.1,dataset$CEL7.1)
> >>>
> >>> library('siggenes')
> >>> datamatrix<- matrix(cbind(controls,experiments),ncol=19)
> >>> y<- rep(0,19)
> >>> y[11:19]<- 1
> >>> gene_names<- as.character(dataset$Hybridization.REF)
> >>> sam.obj = sam(datamatrix,y,gene.names=gene_names,rand=12345)
> >>>
> >>> Output:
> >>> AM Analysis for the Two-Class Unpaired Case Assuming Unequal Variances
> >>>
> >>>   s0 = 0
> >>>
> >>>   Number of permutations: 100
> >>>
> >>>   MEAN number of falsely called variables is computed.
> >>>
> >>>     Delta    p0    False Called    FDR cutlow cutup   j2    j1
> >>> 1    0.1 0.634 28335.89  37013 0.4851 -1.058 0.354 9709 27372
> >>> 2    0.5 0.634 11200.82  21273 0.3336 -2.271 0.910 2447 35850
> >>> 3    0.9 0.634   249.38   1522 0.1038 -3.374 3.088  541 53695
> >>> 4    1.3 0.634     9.67    134 0.0457 -4.402 5.577  127 54669
> >>> 5    1.7 0.634     0.69     20 0.0219 -5.596   Inf   20 54676
> >>> 6    2.1 0.634        0      1      0 -9.072   Inf    1 54676
> >>> 7    2.5 0.634        0      1      0 -9.072   Inf    1 54676
> >>> 8    2.9 0.634        0      1      0 -9.072   Inf    1 54676
> >>> 9    3.3 0.634        0      1      0 -9.072   Inf    1 54676
> >>> 10   3.7 0.634        0      0      0   -Inf   Inf    0 54676
> >>>
> >>> I'm using the rand parameter because results seems to vary a bit. p0
> >>> is in this case 0.634, and I'm not sure how to interpret this. From
> >>> literature, this is described as "Prior probability that a gene is not
> >>> differentially expressed" - What does this exactly mean? Does this
> >>> imply, that there is a ~63% percent chance, that the genes in
> >>> question, are actually NOT differentially expressed?
> >>
> >> It means that about 63% of your genes appear to be not differentially
> >> expressed. So if you choose a gene at random, there is a 63%
> probability
> >> that you will choose one that isn't differentially expressed.
> >>
> >> However, depending on the value of Delta that you choose, the
> expectation is
> >> that a far fewer percentage of the genes selected will be
> differentially
> >> expressed. In other words, you are trying to grab genes with a higher
> >> probability of differential expression, and you are then estimating
> what
> >> percentage of those genes are still likely false positives (e.g., if
> you
> >> choose a Delta of 1.3, you will get 134 significant genes, and will
> expect
> >> that about 10 of those will be false positives).
> >>
> >>
> >>> I've also found some varying sources saying that it is a good idea to
> >>> log2 transform data before inputting into SAM. Does this still apply,
> >>> and if so, why?
> >>
> >> This is because the t-test is based on means, which are not very robust
> to
> >> outliers. Gene expression data tend to have a strong right skew,
> meaning
> >> that most of the data are within a certain range, but there are some
> values
> >> much higher. If you take logs, it tends to minimize the skew, so the
> large
> >> values have less of an effect (on the linear scale, expression values
> range
> >> from 0-65,000, on log2 scale, they range from 0-16). It doesn't matter
> what
> >> base you use, but people have tended to use log base 2 because then a
> >> difference of 1 indicates a two-fold difference on the linear scale.
> >>
> >> Best,
> >>
> >> Jim
> >>
> >>
> >>> Best Regards,
> >>>
> >>> David Westergaard
> >>> Undergraduate student
> >>> Technical University of Denmark
> >>>
> >>> _______________________________________________
> >>> Bioconductor mailing list
> >>> Bioconductor at r-project.org
> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>> Search the archives:
> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor

--