[BioC] Deciding on a cut off after QC

Tue May 17 05:48:15 CEST 2005

Dear Adai,
Thank you for the detailed response.
Correct me if I'm wrong.
What you are saying is that after applying the "fdr"
I give a cut off of 0.05 for the p-value and write all
those spots into an object (positives).
That means, I do  not consider the B values.

In essence, it is just a one sample t test afetr
adjusting the p-value and taking into consideration
all those spots that have a p-value < 0.05?

>From my code in a previous mail to Gordon Smyth, I
have done a quality control (QC) using parameters and
threshold values prescribed by genepix.
I know from experience, a large number of spots do not
get through the filter.
As I understand from LIMMA, a weight of "0" is given
to any spot that does not get through the QC filter.
How do I exclude these spots from the final result
summary file?
I need to get a set of significantly differentially
expressed genes that have got through the QC filter.
How do I do it using LIMMA?
Thank you,
-Ankit

--- Adaikalavan Ramasamy <ramasamy at cancer.org.uk>
wrote:

> See comments below.
> 
> On Mon, 2005-05-16 at 05:42 -0700, Ankit Pal wrote:
> > Dear Dr Smyth,
> > I guess there has been a mistake in communication.
> > There are 39000 spots on the array but the number
> of
> > probes are lesser as there are replicates(not a
> fixed
> > number) on the array.
> > The output from toptable() gives me a total of
> 39000
> > rows, one for each spot.
> > I need to give an output of the spots most
> significant
> > in this list.
> 
> I do not understand this clearly. Do you mean you
> have some duplicated
> spots or is there redundancy when you convert to
> something common like
> unigene ID ? 
> 
> If the former, I believe there are some facilities
> within LIMMA for
> equal number duplications per spots but I am not
> sure. For the latter,
> you could use tapply or something to find the
> min/max of the statistics.
> 
> > What is the cut off value(p or B or any other) I
> need
> > to use above which I can consider a set of
> > differentially expressed genes significant enough
> for
> > further use (eg :function based clustering of all
> > significantly overexpressed genes).
> > I cannot use all the genes in the topTable()
> output.
> > Where do I put the cut off?
> > Is there a way I can calculate it?
> 
> Selection of p-value threshold is arbitrary. However
> it would be a good
> idea to adjust for multiple comparison. My preferred
> method is the FDR
> by Benjamini and Hochberg but there are many other
> methods too each with
> a different meaning. I will illustrate FDR using the
> slightly modified
> example from help(topTable) :
> 
> set.seed(123)  # for reproducibility only - do not
> use this in real life
> 
> M <- matrix(rnorm(10000*6,sd=0.3), 10000, 6)
> M[1:100, 1:3] <- M[1:100, 1:3] + 2
> design <- cbind( First3Arrays=c(1,1,1,0,0,0),   
>                  Last3Arrays=c(0,0,0,1,1,1))
> fit <- lmFit(M, design=design)
> fit <- eBayes(fit)
> 
> tab <- topTable( fit, adjust.method="fdr",
> sort.by="p", n=nrow(M) )
> 
> positives <- rownames(tab)[ which(tab[ , "P.Value"]
> < 0.05) ]
> 
> The "positives" are the names of genes that we
> declare as differentially
> expression and it is expected that 5% of these genes
> are false positives
> on average. 
> 
> Note that in this case we can explicitly calculate
> the FP rate as
> follows because we have simulated the datasets.
> 
> mean( ifelse( as.numeric(positives) <= 100, 0, 1 ) )
> 0.06542056
> 
> I am sure Gordon Symth has a far more elegant and
> flexible version of
> this documented somewhere in his user guide.
> 
> Regards, Adai
> 
> 
> > sincerely,
> > -Ankit
> > 
> > --- Gordon K Smyth <smyth at wehi.EDU.AU> wrote:
> > > On Mon, May 16, 2005 9:49 pm, Ankit Pal said:
> > > > Dear Dr Smyth,
> > > > I'm sorry for not having specified which
> result
> > > file.
> > > > It is the final result summary we get after we
> > > give
> > > > the command
> > > > Resultfile <- topTable(fit,n=200,
> adjust="fdr")
> > > > A sample result file has been attached.
> > > > The code I used for my analysis is
> > > >
> > > >> targets <- readTargets("target.txt")
> > > >
> > > > #The QC filter
> > > >> myfun <- function(x,threshold=55){
> > > > + okred <- abs(x[,"% > B635+2SD"]) > threshold
> > > > + okgreen <- abs(x[,"% > B532+2SD"]) >
> threshold
> > > > + okflag <- abs(x[,"Flags"]) > 0
> > > > + okRGN <- abs(x[,"Rgn R"]) > 0.6
> > > > + as.numeric(okgreen || okred || okflag ||
> okRGN)
> > > > + }
> > > > #end of QC filter
> > > >
> > > >> RG_7 <- read.maimages(targets$FileName,
> > > > source="genepix",wt.fun=myfun)
> > > >> RG_7$genes <- readGAL()
> > > 
> > > As I said last week, this command is not needed
> with
> > > GenePix data.  You should omit it.
> > > 
> > > >> RG_7$printer <- getLayout(RG_7$genes)
> > > >> MA_7 <-
> > > normalizeWithinArrays(RG_7,method="loess")
> > > >> MA_7 <- normalizeBetweenArrays(MA_7)
> > > >> fit_7 <- lmFit(MA_7, design=c(1,-1,1,-1))
> > > >> fit_7 <- eBayes(fit_7)
> > > >> options(digits=3)
> > > >> Resultfile_7 <- topTable(fit_7, n=39000,
> > > > adjust="fdr")
> > > >> Resdat_7 <-data.frame(Resultfile_7)
> > > 
> > > Resultfile_7 is already a data.frame.
> > > 
> > > >> write.table(Resdat_,file='Result.csv',quote =
> > > FALSE,
> > > > sep = "\t")
> > > >
> > > > I understand that the spots that do not
> qualify
> > > the QC
> > > > filter are given a weight of "0" by limma and
> are
> > > not
> > > > considered for normalization and will not
> affect
> > > the
> > > > analysis.
> > > > The result file I get contains all the spots
> > > (38000)
> > > > in my case.
> > > 
> > > Mmm, this doesn't make sense.  You have 4 arrays
> > > with 39000 spots on each array.  Hence you have
> > > 4*39000 spots, not 39000.  The output from
> > > topTable() gives a summarized log-ratio for each
> > > probe,
> > > not a result for each spot.
> > > 
> > > Gordon
> > > 
> > > > Didn't the spots that were bad get removed
> from
> > > the
> > > > final result?
> > > > If not what is the cut off value (B, p etc)
> that I
> > > > need to use to get a set of reliable spots(I
> cant
> > > use
> > > > all the 38000) from my result file for my
> > > analysis.
> > > > Is there a fixed formula to derive the same as
> the
> > > > values vary with the analysis.
> > > > Waiting for your reply,
> > > > Thank you,
> > > > -Ankit
> > > 
> > > 
> > >
> > 
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > 
> 
>