[Rd] p.adjust(<NA>s), was "Re: [BioC] limma and p-values"

Sun Jan 9 03:57:10 CET 2005

Martin,

Thanks for this.  I was planning to code suggestions for p.adjust() myself 
today, but you got in before me.

I don't think that the new way of handling NAs is quite correct.  The 
trouble is that all of the adjustment methods other than "none" and 
"bonferonni" are step-down or step-up methods, meaning that you need to 
know all the p-values before you can adjust any of them.

Suppose for example that you have a single p-value 'p=0.05' and you want to 
adjust this using Holm's method with 'n=10'.  Holm's method is a step-down 
Bonferonni method so the adjusted p-value can be anywhere between 0.05 and 
0.5 depending on what the other nine p-values are.  So the correct adjusted 
p-value is indeterminate.

The value returned by 'p.adjust(0.05,method="holm",n=10)' is 0.5.  The same 
value is returned if 'p' is a vector with nine NAs:

   p <- c(0.05,rep(NA,9))
   p.adjust(p,method="holm")

In effect the nine NA p-values are treated as if they are equal to one.  So 
rather than accommodating unknown data, p.adjust() is in effect making data 
up.  I think it would be best not to do this but, if you do, there should 
at least be a warning message like

"warning: p-values indeterminate in the presence of NAs, most conservative 
possible values being returned"

I think that the correct treatment is that implemented for 'na.rm=TRUE', 
i.e., the non-NA adjusted p-values should be the same as if there were no 
NAs in 'p'.  This is because NAs in 'p' virtually always indicate, not an 
existing but unknown p-value, but a p-value which could not be computed 
because there was insufficient data to do so.  This means that there is 
zero probability of rejecting the null hypothesis for these 
cases.  Therefore it is unnecessary to adjust the other p-values for these 
cases.

The argument 'n' to 'p.adjust()' is problematic for the same reasons that 
NAs in 'p' are.  Setting 'n' to be less than than sum(!is.na(p)) should I 
think return an error.  Setting 'n' to be more than sum(!is.na(p)) makes 
all the adjusted p-values indeterminate unless method="none" or 
method="bonferroni".  And setting 'n' to be between sum(!is.na(p)) and 
length(p) when these are different is just horrible.  Would anything be 
lost if this argument was removed?

Gordon

At 07:29 AM 9/01/2005, Martin Maechler wrote:
>I've thought more and made experiements with R code versions
>and just now committed a new version of  p.adjust()  to R-devel
>--> https://svn.r-project.org/R/trunk/src/library/stats/R/p.adjust.R
>which does sensible NA handling by default and
>*additionally* has an "na.rm" argument (set to FALSE by
>default).  The extended 'Examples' secion on the help page
>     https://svn.r-project.org/R/trunk/src/library/stats/man/p.adjust.Rd
>shows how the new NA handling is typically much more sensible
>than using "na.rm = TRUE".
>
>Martin
>
>
> >>>>> "MM" == Martin Maechler <maechler at stat.math.ethz.ch>
> >>>>>     on Sat, 8 Jan 2005 17:19:23 +0100 writes:
>
> >>>>> "GS" == Gordon K Smyth <smyth at wehi.edu.au>
> >>>>>     on Sat, 8 Jan 2005 01:11:30 +1100 (EST) writes:
>
>     MM>     <.............>
>
>     GS> p.adjust() unfortunately gives incorrect results when
>     GS> 'p' includes NAs.  The results from topTable are
>     GS> correct.  topTable() takes care to remove NAs before
>     GS> passing the values to p.adjust().
>
>     MM> There's at least one bug in p.adjust(): The "hommel"
>     MM> method currently does not work at all with NAs (and I
>     MM> have an uncommitted fix ready for this bug).  OTOH, the
>     MM> current version of p.adjust() ``works'' with NA's, apart
>     MM> from Hommel's method, but by using "n = length(p)" in
>     MM> the correction formulae, i.e. *including* the NAs for
>     MM> determining sample size `n' {my fix to "hommel" would do
>     MM> this as well}.
>
>     MM> My question is what p.adjust() should do when there are
>     MM> NA's more generally, or more specifically which `n' to
>     MM> use in the correction formula. Your proposal amounts to
>     MM> ``drop NA's and forget about them till the very end''
>     MM> (where they are wanted in the result), i.e., your sample
>     MM> size `n' would be sum(!is.na(p)) instead of length(p).
>
>     MM> To me it doesn't seem obvious that this setting "n =
>     MM> #{non-NA observations}" is desirable for all P-value
>     MM> adjustment methods. One argument for keeping ``n = #{all
>     MM> observations}'' at least for some correction methods is
>     MM> the following "continuity" one:
>
>     MM> If only a few ``irrelevant'' (let's say > 0.5) P-values
>     MM> are replaced by NA, the adjusted relevant small P-values
>     MM> shouldn't change much, ideally not at all.  I'm really
>     MM> no scholar on this topic, but e.g. for "holm" I think I
>     MM> would want to keep ``full n'' because of the above
>     MM> continuity argument.  BTW, for "fdr", I don't see a
>     MM> straightforward way to achieve the desired continuity.
>     MM> 5D Of course, p.adjust() could adopt the possibility of
>     MM> chosing how NA's should be treated e.g. by another
>     MM> argument ``use.na = TRUE/FALSE'' and hence allow both
>     MM> versions.
>
>     MM> Feedback very welcome, particularly from ``P-value
>     MM> experts'' ;-)
>
>     MM> Martin Maechler, ETH Zurich