[BioC] siggenes SAM FDR, # false discrepancies
Jacob Michaelson
jjmichael at comcast.net
Tue Jun 7 03:36:33 CEST 2005
Thanks for the reply, Holger.
Let me just restate what I understand from what you said, just to
make sure I'm understanding this correctly (it's probably pretty
obvious that I'm no statistician):
The column "False" in the SAM table gives a kind of "worst-case
scenario" regarding the number of false at the given number of called
significant. To get a more realistic expectation of how many of
those are indeed falsely called, we multiply the number in the false
column by the proportion pi hat (denoted here as p0). Is there a
reason why this "adjusted" number of false calls is not directly
reported in the output table?
Does this practical interpretation make sense?
The reason I believe that Excel is multiplying their number of false
calls by pi hat is that (at least in my data) the median number false
converges (as the number of total called decreases) to the exact
value of pi hat. The exact value of pi hat is reported as the median
number false for everything below a certain value. The number false
never goes below pi hat. I thought for a long time as to why that
would be and it occurred to me that maybe the minimum # false that
Excel reports is 1. This minimum is then multiplied by pi hat to
give... pi hat. That's the only way I can make sense of the numbers
Excel gives.
Thanks again for the help.
mfg
--Jake
I
On Jun 6, 2005, at 2:47 PM, Holger Schwender wrote:
> Hi Jacob,
>
> the FDR is estimated in almost exactly the same way as in Excel
> SAM. The
> difference is that by default siggenes uses the mean number of falsely
> called genes (as described, e.g., in the Tusher paper) instead of
> the median
> number of falsely called genes. You can however set med=TRUE in sam
> and the
> median number will be calculated. The other difference is the
> computation of
> pi0. While Excel SAM uses an ad hoc estimate, siggenes uses by
> default a
> natural cubic spline based estimate proposed by John Storey, a
> former PhD
> student of Tishirani and Co-author of the Tusher paper. You can,
> however,
> set lambda=.5 in sam(...) and you will get the adhoc estimate of
> Excel SAM.
> So changing these two defaults will give you exactly the estimates
> that are
> given to you by Excel SAM.
>
> To explain the table: False is the mean/median number of falsely
> called
> genes. If you divide False by Called you will get an estimate which
> corresponds to the Benjamini-Hochberg estimate of the FDR. As
> authors like
> John Storey, Brad Efron etc. have mentioned, this is a conservative
> estimate
> that assumes that all null hypotheses are TRUE, i.e. none of the
> genes are
> differentially expressed. In this case, the number of falsely
> called genes
> is equal to the number of False positives. We however know that
> some of the
> genes are differentially expressed, so we estimate their proportion
> by 1 -
> pi0 and this leads to an estimate for the number of false positives
> of pi0 *
> number of falsely called genes. And the FDR is loosely spoken the
> expected
> ratio of the number of false positives to the number of rejected null
> hypotheses / genes called differentially expressed, and hence FDR =
> pi0 *
> False / Called.
>
> So the column False can be calculated by multiplying FDR by Called and
> dividing this by pi0, i.e. in your example: False = 0.07*4220/.147.
>
> And this is the way both Excel SAM and siggenes compute the FDR.
> The reason
> for this is that the implementation of siggenes follows almost
> exactly the
> implementation of Excel SAM (almost since some parts of Excel SAM
> are not
> known, namely the exact choice of the possible quantiles for the fudge
> factor and the q-value calculation) as described in Tusher et al.
> and the
> Excel SAM manual.
>
> HTH,
> Holger
>
>
>>
>> I know there have been several messages comparing Excel SAM vs
>> siggenes SAM, and several others asking about how the FDR is
>> calculated in siggenes SAM, but none of these have answered a
>> question I have about what to believe in the SAM output table. My
>> table from siggenes:
>>
>>
>> Delta p0 FALSE Called FDR
>> 0.25 0.147 2023.5 4220 0.07
>> 0.5 0.147 1211.5 2686 0.066
>> 0.75 0.147 789 1617 0.072
>> 1 0.147 390 866 0.066
>> 1.25 0.147 197 496 0.058
>> 1.5 0.147 120.5 337 0.053
>> 1.75 0.147 69.5 228 0.045
>> 2 0.147 38.5 150 0.038
>> 2.25 0.147 22 94 0.034
>> 2.5 0.147 13 65 0.029
>> 2.75 0.147 7 30 0.034
>> 3 0.147 4 16 0.037
>> 3.25 0.147 1 7 0.021
>> 3.5 0.147 0.5 4 0.018
>> 3.75 0.147 0 0 0
>>
>> The FDR here is Pi hat * (false/called). I'm not sure what that is
>> supposed to mean. Which number of false am I supposed to believe?
>> The number false as calculated by multiplying the FDR by the #
>> called? (This makes sense to me, for example: 0.07*4220=297.5
>> false) Or the # false as reported in the false column? (This doesn't
>> make sense to me...what's the point of the FDR as calculated if it
>> doesn't jibe with the # false and the # called?)
>>
>> Excel SAM seems to circumvent this problem by multiplying the number
>> false by Pi hat (and reporting only that product, not the number
>> false before being multiplied), and then calculating the FDR as
>> false/
>> called, this FDR then implicitly has Pi hat in it (or so it seems to
>> me). This way, # false, # called, and the observed FDR all
>> correspond correctly, unlike siggenes SAM.
>>
>> Thanks in advance for any help.
>>
>> --Jake
>>
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>
>>
>
> --
> Geschenkt: 3 Monate GMX ProMail gratis + 3 Ausgaben stern gratis
> ++ Jetzt anmelden & testen ++ http://www.gmx.net/de/go/promail ++
>
More information about the Bioconductor
mailing list