[BioC] Deciding on a cut off after QC

Adaikalavan Ramasamy ramasamy at cancer.org.uk
Tue May 17 11:52:21 CEST 2005


See comments below. If anyone finds them erroneous, please correct me.

On Mon, 2005-05-16 at 20:48 -0700, Ankit Pal wrote:
> Dear Adai,
> Thank you for the detailed response.
> Correct me if I'm wrong.
> What you are saying is that after applying the "fdr"
> I give a cut off of 0.05 for the p-value and write all
> those spots into an object (positives).
> That means, I do  not consider the B values.

In some sense, yes. In paragraph 2 of section 7 of [1], Symth says that
posterior odds (B) depend on the estimate of the proportion of genes
that are differentially expressed while moderated t-test (whose p-values
are given in topTable output) does not.

The B values are useful if you want to calculate posterior probabilities
of differential expression.


[1] Smyth, G. K. (2004). Linear models and empirical Bayes methods for
assessing differential expression in microarray experiments. Statistical
Applications in Genetics and Molecular Biology 3, No. 1, Article 3.
http://www.bepress.com/sagmb/vol3/iss1/art3/

> In essence, it is just a one sample t test afetr
> adjusting the p-value and taking into consideration
> all those spots that have a p-value < 0.05?

No, this is based on moderated t-test where the variance in the
denominator is shrunk towards a global variance like what SAM does. But
unlike SAM, LIMMA has 
1) a better theoretical derivation and shrinks the variance instead of
standard deviation
2) shows that the moderated t-test follows a t-distribution with
augmented degrees of freedom and thus no need to obtain p-values by
permutation testing, hence computationally faster
3) more flexible to extend to other designs


> >From my code in a previous mail to Gordon Smyth, I
> have done a quality control (QC) using parameters and
> threshold values prescribed by genepix.
> I know from experience, a large number of spots do not
> get through the filter.
> As I understand from LIMMA, a weight of "0" is given
> to any spot that does not get through the QC filter.
> How do I exclude these spots from the final result
> summary file?
> I need to get a set of significantly differentially
> expressed genes that have got through the QC filter.
> How do I do it using LIMMA?

I assume you have read the section 5.4 of the User Guide
http://bioinf.wehi.edu.au/limma/usersguide.pdf

I am not familiar with spot filtering nor genepix, so I cannot help you
much here. 

But re-reading your previous mail, I noticed that you used something
like "% > B635+2SD". Are you sure this is not "% > B635 + 2 SD". Note
that the spaces matter ! Otherwise you can try renaming the columns to
something like "green2sd" and "red2sd" just in case R is mis-reading the
spaces somehow.

Best wishes,

Adai


> Thank you,
> -Ankit
> 
> 
> 
> 
> --- Adaikalavan Ramasamy <ramasamy at cancer.org.uk>
> wrote:
> 
> > See comments below.
> > 
> > On Mon, 2005-05-16 at 05:42 -0700, Ankit Pal wrote:
> > > Dear Dr Smyth,
> > > I guess there has been a mistake in communication.
> > > There are 39000 spots on the array but the number
> > of
> > > probes are lesser as there are replicates(not a
> > fixed
> > > number) on the array.
> > > The output from toptable() gives me a total of
> > 39000
> > > rows, one for each spot.
> > > I need to give an output of the spots most
> > significant
> > > in this list.
> > 
> > I do not understand this clearly. Do you mean you
> > have some duplicated
> > spots or is there redundancy when you convert to
> > something common like
> > unigene ID ? 
> > 
> > If the former, I believe there are some facilities
> > within LIMMA for
> > equal number duplications per spots but I am not
> > sure. For the latter,
> > you could use tapply or something to find the
> > min/max of the statistics.
> > 
> > > What is the cut off value(p or B or any other) I
> > need
> > > to use above which I can consider a set of
> > > differentially expressed genes significant enough
> > for
> > > further use (eg :function based clustering of all
> > > significantly overexpressed genes).
> > > I cannot use all the genes in the topTable()
> > output.
> > > Where do I put the cut off?
> > > Is there a way I can calculate it?
> > 
> > Selection of p-value threshold is arbitrary. However
> > it would be a good
> > idea to adjust for multiple comparison. My preferred
> > method is the FDR
> > by Benjamini and Hochberg but there are many other
> > methods too each with
> > a different meaning. I will illustrate FDR using the
> > slightly modified
> > example from help(topTable) :
> > 
> > set.seed(123)  # for reproducibility only - do not
> > use this in real life
> > 
> > M <- matrix(rnorm(10000*6,sd=0.3), 10000, 6)
> > M[1:100, 1:3] <- M[1:100, 1:3] + 2
> > design <- cbind( First3Arrays=c(1,1,1,0,0,0),   
> >                  Last3Arrays=c(0,0,0,1,1,1))
> > fit <- lmFit(M, design=design)
> > fit <- eBayes(fit)
> > 
> > tab <- topTable( fit, adjust.method="fdr",
> > sort.by="p", n=nrow(M) )
> > 
> > positives <- rownames(tab)[ which(tab[ , "P.Value"]
> > < 0.05) ]
> > 
> > The "positives" are the names of genes that we
> > declare as differentially
> > expression and it is expected that 5% of these genes
> > are false positives
> > on average. 
> > 
> > Note that in this case we can explicitly calculate
> > the FP rate as
> > follows because we have simulated the datasets.
> > 
> > mean( ifelse( as.numeric(positives) <= 100, 0, 1 ) )
> > 0.06542056
> > 
> > I am sure Gordon Symth has a far more elegant and
> > flexible version of
> > this documented somewhere in his user guide.
> > 
> > Regards, Adai
> > 
> > 
> > > sincerely,
> > > -Ankit
> > > 
> > > --- Gordon K Smyth <smyth at wehi.EDU.AU> wrote:
> > > > On Mon, May 16, 2005 9:49 pm, Ankit Pal said:
> > > > > Dear Dr Smyth,
> > > > > I'm sorry for not having specified which
> > result
> > > > file.
> > > > > It is the final result summary we get after we
> > > > give
> > > > > the command
> > > > > Resultfile <- topTable(fit,n=200,
> > adjust="fdr")
> > > > > A sample result file has been attached.
> > > > > The code I used for my analysis is
> > > > >
> > > > >> targets <- readTargets("target.txt")
> > > > >
> > > > > #The QC filter
> > > > >> myfun <- function(x,threshold=55){
> > > > > + okred <- abs(x[,"% > B635+2SD"]) > threshold
> > > > > + okgreen <- abs(x[,"% > B532+2SD"]) >
> > threshold
> > > > > + okflag <- abs(x[,"Flags"]) > 0
> > > > > + okRGN <- abs(x[,"Rgn R"]) > 0.6
> > > > > + as.numeric(okgreen || okred || okflag ||
> > okRGN)
> > > > > + }
> > > > > #end of QC filter
> > > > >
> > > > >> RG_7 <- read.maimages(targets$FileName,
> > > > > source="genepix",wt.fun=myfun)
> > > > >> RG_7$genes <- readGAL()
> > > > 
> > > > As I said last week, this command is not needed
> > with
> > > > GenePix data.  You should omit it.
> > > > 
> > > > >> RG_7$printer <- getLayout(RG_7$genes)
> > > > >> MA_7 <-
> > > > normalizeWithinArrays(RG_7,method="loess")
> > > > >> MA_7 <- normalizeBetweenArrays(MA_7)
> > > > >> fit_7 <- lmFit(MA_7, design=c(1,-1,1,-1))
> > > > >> fit_7 <- eBayes(fit_7)
> > > > >> options(digits=3)
> > > > >> Resultfile_7 <- topTable(fit_7, n=39000,
> > > > > adjust="fdr")
> > > > >> Resdat_7 <-data.frame(Resultfile_7)
> > > > 
> > > > Resultfile_7 is already a data.frame.
> > > > 
> > > > >> write.table(Resdat_,file='Result.csv',quote =
> > > > FALSE,
> > > > > sep = "\t")
> > > > >
> > > > > I understand that the spots that do not
> > qualify
> > > > the QC
> > > > > filter are given a weight of "0" by limma and
> > are
> > > > not
> > > > > considered for normalization and will not
> > affect
> > > > the
> > > > > analysis.
> > > > > The result file I get contains all the spots
> > > > (38000)
> > > > > in my case.
> > > > 
> > > > Mmm, this doesn't make sense.  You have 4 arrays
> > > > with 39000 spots on each array.  Hence you have
> > > > 4*39000 spots, not 39000.  The output from
> > > > topTable() gives a summarized log-ratio for each
> > > > probe,
> > > > not a result for each spot.
> > > > 
> > > > Gordon
> > > > 
> > > > > Didn't the spots that were bad get removed
> > from
> > > > the
> > > > > final result?
> > > > > If not what is the cut off value (B, p etc)
> > that I
> > > > > need to use to get a set of reliable spots(I
> > cant
> > > > use
> > > > > all the 38000) from my result file for my
> > > > analysis.
> > > > > Is there a fixed formula to derive the same as
> > the
> > > > > values vary with the analysis.
> > > > > Waiting for your reply,
> > > > > Thank you,
> > > > > -Ankit
> > > > 
> > > > 
> > > >
> > > 
> > > _______________________________________________
> > > Bioconductor mailing list
> > > Bioconductor at stat.math.ethz.ch
> > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > > 
> > 
> > 
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around 
> http://mail.yahoo.com 
>



More information about the Bioconductor mailing list