[R] Use of geometric mean .. in good data analysis

Mon Jan 22 21:47:53 CET 2024

In the spirit of Martin's comments, it is perhaps worthwhile to note one of
John Tukey's (who I actually knew) pertinent quotes:
"The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of data.
<https://www.azquotes.com/quote/603406>"

"Sunset Salvo" by John Tukey in The American Statistician, Volume 40, No. 1
(pp. 72-76), www.jstor.org. February 1986.

Cheers,
Bert

<https://www.azquotes.com/author/14847-John_Tukey>

On Mon, Jan 22, 2024 at 12:23 PM Bert Gunter <bgunter.4567 using gmail.com> wrote:

>
> Ah.... LOD's, typically LLOD's ("lower limits of detection").
>
> Disclaimer: I am *NOT* in any sense an expert on such matters. What
> follows are just some comments based on my personal experience. Please
> filter accordingly. Also, while I kept it on list as Martin suggested it
> might be useful to do so, most folks probably can safely ignore the rant
> that follows as off topic and not of interest. So you've been warned!!
>
> The rant:
> My experience is: data that contain a "bunch" of values that are, e.g.
> below a LLOD, are frequently reported and/or analyzed by various ad hoc,
> and imho, uniformly bad methods. e.g.:
>
> 1) The censored values are recorded and analyzed as at the LLOD;
> 2) The censored values are recorded and analyzed at some arbitrary value
> below the LLOD, like LLOD/2;
> 3) The censored values are are "imputed" by ad hoc methods, e.g. uniform
> random values between 0 and the LLOD for left censoring.
>
> To repeat, *IMO*, all of this is junk and will produced misleading
> statistical results. Whether they mislead enough to substantively affect
> the science or regulatory decisions depend on the specifics of the
> circumstances. I accept no general claim as to their innocuousness.
>
> Further:
>
> a) When you have a "lot" of values -- 50%? 75%?, 25%? -- face facts: you
> have (practically) no useful information from the values that you do have
> to infer what the distribution of values that you don't have looks like.
> All one can sensibly do is say that x% of the values are below a LOD and
> here's the distribution of what lies above. Presumably, if you have such
> data conditional on covariates with the obvious intent to determine the
> relationship to those covariates, you could analyze the percentages of
> LLOD's and known values separately. There are undoubtedly more
> sophisticated methods out there, so this is where you need to go to the
> literature to see what might suit; though I think it will still have to
> come down to looking at these separately (e.g. with extra parameters to
> account for unmeasurable values). Another way of saying this is: any
> analysis which treats all the data as arising from a single distribution
> will depend more on the assumptions you make than on the data. So good luck
> with that!
>
> b) If you have a "modest" amount of (known) censoring -- 5%?, 20%? 10%? --
> methods for the analysis of censored data should be useful. My
> understanding is that MI (multiple imputation) is regarded as a generally
> useful approach, and there are many R packages that can do various flavors
> of this. Again, you should consult the literature: there are very likely
> nontechnical reviews of this topic, too, as well as online discussions and
> tutorials.
>
> So if you are serious about dealing with this and have a lot of data with
> these issues, my advice would be to stop looking for ad hoc advice and dig
> into the literature: it's one of the many areas of "data science" where
> seemingly simple but pervasive questions require complex answers.
>
> And, again, heed my personal caveats.
>
> Thus endeth my rant.
>
> Cheers to all,
> Bert
>
>
>
> On Mon, Jan 22, 2024 at 9:29 AM Rich Shepard <rshepard using appl-ecosys.com>
> wrote:
>
>> On Mon, 22 Jan 2024, Martin Maechler wrote:
>>
>> > I think it is a good question, not really only about geo-chemistry, but
>> > about statistics in applied sciences (and engineering for that matter).
>>
>> > John W Tukey (and several other of the grands of the time) had the log
>> > transform among the "First aid transformations":
>> >
>> > If the data for a continuous variable must all be positive it is also
>> > typically the case that the distribution is considerably skewed to the
>> > right. In such a case behave as a good human who sees another human in
>> > health distress: apply First Aid -- do the things you learned to do
>> > quickly without too much thought, because things must happen fast ---to
>> > hopefully save the other's life.
>>
>> Martin,
>>
>> Thanks very much. I will look further into this because toxic metals and
>> organic compounds in geochemical collections almost always have censored
>> lab
>> results (below method dection limits) that range from about 15% to 80% or
>> more, and there almost always are very high extreme values.
>>
>> I'll learn to understand what benefits log transforms have over
>> compositional data analyses.
>>
>> Best regards,
>>
>> Rich
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>

	[[alternative HTML version deleted]]