[BioC] dataset dim for siggenes

James W. MacDonald jmacdon at uw.edu
Fri Sep 12 23:11:29 CEST 2014

Hi Fred,

I'll take the second question first. The methods that have been developed
for analyzing microarray data are all just modifications of the existing
linear modeling methods that people have used for years (t-test, ANOVA,
linear modeling of continuous covariates, etc). The reason that people have
developed these methods is because in general, with microarray data you run
into the problem of making tons of comparisons with very little
replication. The problem with doing something like that is you a) need to
adjust the p-values to reflect that you are making (possibly thousands) of
simultaneous comparisons, and b) you often have maybe 3 or 4 replicates for
each group, so your power to detect differences is probably really low. So
the goal was to figure out ways to improve the power for these comparisons
in a statistically rigorous manner, and there were lots of ways that people
developed to do that.

There was also some concern that the usual assumption of normally
distributed data might not hold for all the genes being compared, so
different groups developed ways to increase power and also generate
permuted null distributions, so you wouldn't have to make an assumption
that might not hold.

But in the end, all these methods (limma, siggenes, multtest, etc) are just
fitting t-tests that are modified to help increase power. So they are all
doing essentially the same thing, but in a slightly different manner. So if
you run your samples through limma, and then siggenes, and then multtest,
any changes in your results will simply reflect differences in the methods
used, but won't give you any more information about your samples. And since
you have 15 replicates for each group, you would probably get very similar
results if you were to just use 'regular' methods, because you aren't
measuring that many genes, and you have pretty good replication.

On the other hand, running a new set of samples will tell you a great deal.
This has to do with the underlying hypothesis that you are (usually)
testing. In general when you are doing a comparison, you are trying to
estimate a population parameter using a sample from that population. In
other words, you are trying to make a statement about all the members of a
population, based on a sample from that population. There is always the
possibility that you were unlucky and chose a set of subjects from the two
populations you are comparing that are really different, but in truth there
is no difference between the two populations. You then make your
measurements, and say 'look, gene X appears to be expressed at a much
higher level in population 1 as compared to population 2'. But remember,
you were unlucky in your choice of subjects to represent the two
populations, and there really aren't any differences. So repeating the
experiment with new subjects will likely not have the same result, and you
will be glad that you didn't try to publish your results.

Or alternatively, if you re-run your analysis for the 10 top genes, and
they are all significant in the next set of samples, then you have pretty
good evidence that there really is a difference between the two
populations, because you got the same results with two separate sets of
subjects. But of course that assumes you are doing a reasonable job of
selecting subjects in an unbiased manner, which is a different topic

For the first question, there are any number of things you can and should
test. I won't go into them here because a simple google search like 'R
testing anova assumptions' is likely to bring up all the results you need
to answer that question.



On Fri, Sep 12, 2014 at 3:53 PM, <ferreirafm at usp.br> wrote:

> Hi Jim,
> Could you please possibly tell me which tests should I have to perform in
> order to ensure that my data fulfills the linear model assumptions?
> Turning back to my question "performing several different tests to decide
> which mirs to take", could you explain a little bit more why such approach
> doesn make sense.
> Best,
> Fred
> ------------------------------
> *De: *"James W. MacDonald" <jmacdon at uw.edu>
> *Para: *ferreirafm at usp.br
> *Cc: *"bioconductor" <bioconductor at r-project.org>
> *Enviadas: *Sexta-feira, 12 de Setembro de 2014 12:47:55
> *Assunto: *Re: [BioC] dataset dim for siggenes
> Hi Fred,
> I am assuming you have 116 miRNAs, and 60 samples. In which case you could
> probably just use a conventional t-test or linear model, although using
> limma wouldn't be a controversial decision. Not too sure about siggenes
> though. You have to estimate the proportion of true nulls, and I don't know
> if 116 comparisons are enough.
> But the larger question is the issue of running further statistical tests
> for validation. I am not sure what you mean by that. Quantitative PCR is
> (for better or worse) assumed to be the 'gold standard' for quantification
> of nucleic acid sequences, so there doesn't seem to be much more to do.
> Certainly re-running the analyses using a slightly different method isn't
> useful. That's like weighing yourself on a bunch of different scales; it
> tells you way more about the scales than it does about your weight.
> I think the next step (or really, the first step if you haven't already
> done so) is to ensure that your data meet all the underlying assumptions
> for linear modelling, so that you can have confidence in the conclusions
> you draw from the results.
> Best,
> Jim
> On Fri, Sep 12, 2014 at 11:18 AM, <ferreirafm at usp.br> wrote:
>> Hi list,
>> I have a qPCR 116 x60 data set processed with limma. Results showed 30 DE
>> miRNAs. My idea is to pick-up 10 of them for validation running further
>> statistical tests and taking the most recurrent mirs from all analyses
>> (does it make sense?). Well, I was thinking of using siggenes, however,
>> their authors recommend it for high- dimensional data. Will siggenes be
>> suitable for my data? if not, could someone suggest others packages and
>> perhaps tests more appropriated to this size data?
>> Best.
>> Fred
>>         [[alternative HTML version deleted]]
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> --
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099

James W. MacDonald, M.S.
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099

	[[alternative HTML version deleted]]

More information about the Bioconductor mailing list