[BioC] dataset dim for siggenes

Sat Sep 13 00:45:27 CEST 2014

Hi Jim, 
Thank you very much for your really nice explanation. 
I'm going to study your answer and, if you don't mind, I would like turn back to it later. 
I thought that bayesian approach implemented on LIMMA would have different assumptions from t-test and ANOVA . 
Also, in fact, normality condition doesn't hold true for all miRNAs along patients. I'll turn back to ANOVA assumptions to make additional tests.What to do if they fail? 

About sampling, we are trying to gather patients as similar as possible to that from the first experiment, using several criteria like age, sex, weight, heart flow and other factors commonly used for phenotyping. I hope we are luck in the sense that you pointed. 
Best, 
Fred 

----- Mensagem original -----

> De: "James W. MacDonald" <jmacdon at uw.edu>
> Para: ferreirafm at usp.br
> Cc: "bioconductor" <bioconductor at r-project.org>
> Enviadas: Sexta-feira, 12 de Setembro de 2014 18:11:29
> Assunto: Re: [BioC] dataset dim for siggenes

> Hi Fred,

> I'll take the second question first. The methods that have been
> developed for analyzing microarray data are all just modifications
> of the existing linear modeling methods that people have used for
> years (t-test, ANOVA, linear modeling of continuous covariates,
> etc). The reason that people have developed these methods is because
> in general, with microarray data you run into the problem of making
> tons of comparisons with very little replication. The problem with
> doing something like that is you a) need to adjust the p-values to
> reflect that you are making (possibly thousands) of simultaneous
> comparisons, and b) you often have maybe 3 or 4 replicates for each
> group, so your power to detect differences is probably really low.
> So the goal was to figure out ways to improve the power for these
> comparisons in a statistically rigorous manner, and there were lots
> of ways that people developed to do that.

> There was also some concern that the usual assumption of normally
> distributed data might not hold for all the genes being compared, so
> different groups developed ways to increase power and also generate
> permuted null distributions, so you wouldn't have to make an
> assumption that might not hold.

> But in the end, all these methods (limma, siggenes, multtest, etc)
> are just fitting t-tests that are modified to help increase power.
> So they are all doing essentially the same thing, but in a slightly
> different manner. So if you run your samples through limma, and then
> siggenes, and then multtest, any changes in your results will simply
> reflect differences in the methods used, but won't give you any more
> information about your samples. And since you have 15 replicates for
> each group, you would probably get very similar results if you were
> to just use 'regular' methods, because you aren't measuring that
> many genes, and you have pretty good replication.

> On the other hand, running a new set of samples will tell you a great
> deal. This has to do with the underlying hypothesis that you are
> (usually) testing. In general when you are doing a comparison, you
> are trying to estimate a population parameter using a sample from
> that population. In other words, you are trying to make a statement
> about all the members of a population, based on a sample from that
> population. There is always the possibility that you were unlucky
> and chose a set of subjects from the two populations you are
> comparing that are really different, but in truth there is no
> difference between the two populations. You then make your
> measurements, and say 'look, gene X appears to be expressed at a
> much higher level in population 1 as compared to population 2'. But
> remember, you were unlucky in your choice of subjects to represent
> the two populations, and there really aren't any differences. So
> repeating the experiment with new subjects will likely not have the
> same result, and you will be glad that you didn't try to publish
> your results.

> Or alternatively, if you re-run your analysis for the 10 top genes,
> and they are all significant in the next set of samples, then you
> have pretty good evidence that there really is a difference between
> the two populations, because you got the same results with two
> separate sets of subjects. But of course that assumes you are doing
> a reasonable job of selecting subjects in an unbiased manner, which
> is a different topic altogether...

> For the first question, there are any number of things you can and
> should test. I won't go into them here because a simple google
> search like 'R testing anova assumptions' is likely to bring up all
> the results you need to answer that question.

> Best,

> Jim

> On Fri, Sep 12, 2014 at 3:53 PM, < ferreirafm at usp.br > wrote:

> > Hi Jim,
> 
> > Could you please possibly tell me which tests should I have to
> > perform in order to ensure that my data fulfills the linear model
> > assumptions?
> 
> > Turning back to my question "performing several different tests to
> > decide which mirs to take", could you explain a little bit more why
> > such approach doesn make sense.
> 
> > Best,
> 
> > Fred
> 

> > > De: "James W. MacDonald" < jmacdon at uw.edu >
> > 
> 
> > > Para: ferreirafm at usp.br
> > 
> 
> > > Cc: "bioconductor" < bioconductor at r-project.org >
> > 
> 
> > > Enviadas: Sexta-feira, 12 de Setembro de 2014 12:47:55
> > 
> 
> > > Assunto: Re: [BioC] dataset dim for siggenes
> > 
> 

> > > Hi Fred,
> > 
> 

> > > I am assuming you have 116 miRNAs, and 60 samples. In which case
> > > you
> > > could probably just use a conventional t-test or linear model,
> > > although using limma wouldn't be a controversial decision. Not
> > > too
> > > sure about siggenes though. You have to estimate the proportion
> > > of
> > > true nulls, and I don't know if 116 comparisons are enough.
> > 
> 

> > > But the larger question is the issue of running further
> > > statistical
> > > tests for validation. I am not sure what you mean by that.
> > > Quantitative PCR is (for better or worse) assumed to be the 'gold
> > > standard' for quantification of nucleic acid sequences, so there
> > > doesn't seem to be much more to do. Certainly re-running the
> > > analyses using a slightly different method isn't useful. That's
> > > like
> > > weighing yourself on a bunch of different scales; it tells you
> > > way
> > > more about the scales than it does about your weight.
> > 
> 

> > > I think the next step (or really, the first step if you haven't
> > > already done so) is to ensure that your data meet all the
> > > underlying
> > > assumptions for linear modelling, so that you can have confidence
> > > in
> > > the conclusions you draw from the results.
> > 
> 

> > > Best,
> > 
> 

> > > Jim
> > 
> 

> > > On Fri, Sep 12, 2014 at 11:18 AM, < ferreirafm at usp.br > wrote:
> > 
> 

> > > > Hi list,
> > > 
> > 
> 
> > > > I have a qPCR 116 x60 data set processed with limma. Results
> > > > showed
> > > > 30 DE miRNAs. My idea is to pick-up 10 of them for validation
> > > > running further statistical tests and taking the most recurrent
> > > > mirs
> > > > from all analyses (does it make sense?). Well, I was thinking
> > > > of
> > > > using siggenes, however, their authors recommend it for high-
> > > > dimensional data. Will siggenes be suitable for my data? if
> > > > not,
> > > > could someone suggest others packages and perhaps tests more
> > > > appropriated to this size data?
> > > 
> > 
> 
> > > > Best.
> > > 
> > 
> 
> > > > Fred
> > > 
> > 
> 
> > > > [[alternative HTML version deleted]]
> > > 
> > 
> 

> > > > _______________________________________________
> > > 
> > 
> 
> > > > Bioconductor mailing list
> > > 
> > 
> 
> > > > Bioconductor at r-project.org
> > > 
> > 
> 
> > > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > > 
> > 
> 
> > > > Search the archives:
> > > > http://news.gmane.org/gmane.science.biology.informatics.conductor
> > > 
> > 
> 

> > > --
> > 
> 

> > > James W. MacDonald, M.S.
> > 
> 
> > > Biostatistician
> > 
> 
> > > University of Washington
> > 
> 
> > > Environmental and Occupational Health Sciences
> > 
> 
> > > 4225 Roosevelt Way NE, # 100
> > 
> 
> > > Seattle WA 98105-6099
> > 
> 

> --

> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099

	[[alternative HTML version deleted]]