[R] Use of geometric mean .. in good data analysis

Mon Jan 22 21:57:34 CET 2024

Still OT... but here is my own (I think previously mentioned here) rant on people thrashing about with log transformation and an all-too-common kludge to deal with zeros mixed among small numbers... https://gist.github.com/jdnewmil/99301a88de702ad2fcbaef33326b08b4

OP perhaps posting a link here to your question posed wherever you end up with it will help shorten this thread.

On January 22, 2024 12:23:20 PM PST, Bert Gunter <bgunter.4567 using gmail.com> wrote:
>Ah.... LOD's, typically LLOD's ("lower limits of detection").
>
>Disclaimer: I am *NOT* in any sense an expert on such matters. What follows
>are just some comments based on my personal experience. Please filter
>accordingly. Also, while I kept it on list as Martin suggested it might be
>useful to do so, most folks probably can safely ignore the rant that
>follows as off topic and not of interest. So you've been warned!!
>
>The rant:
>My experience is: data that contain a "bunch" of values that are, e.g.
>below a LLOD, are frequently reported and/or analyzed by various ad hoc,
>and imho, uniformly bad methods. e.g.:
>
>1) The censored values are recorded and analyzed as at the LLOD;
>2) The censored values are recorded and analyzed at some arbitrary value
>below the LLOD, like LLOD/2;
>3) The censored values are are "imputed" by ad hoc methods, e.g. uniform
>random values between 0 and the LLOD for left censoring.
>
>To repeat, *IMO*, all of this is junk and will produced misleading
>statistical results. Whether they mislead enough to substantively affect
>the science or regulatory decisions depend on the specifics of the
>circumstances. I accept no general claim as to their innocuousness.
>
>Further:
>
>a) When you have a "lot" of values -- 50%? 75%?, 25%? -- face facts: you
>have (practically) no useful information from the values that you do have
>to infer what the distribution of values that you don't have looks like.
>All one can sensibly do is say that x% of the values are below a LOD and
>here's the distribution of what lies above. Presumably, if you have such
>data conditional on covariates with the obvious intent to determine the
>relationship to those covariates, you could analyze the percentages of
>LLOD's and known values separately. There are undoubtedly more
>sophisticated methods out there, so this is where you need to go to the
>literature to see what might suit; though I think it will still have to
>come down to looking at these separately (e.g. with extra parameters to
>account for unmeasurable values). Another way of saying this is: any
>analysis which treats all the data as arising from a single distribution
>will depend more on the assumptions you make than on the data. So good luck
>with that!
>
>b) If you have a "modest" amount of (known) censoring -- 5%?, 20%? 10%? --
>methods for the analysis of censored data should be useful. My
>understanding is that MI (multiple imputation) is regarded as a generally
>useful approach, and there are many R packages that can do various flavors
>of this. Again, you should consult the literature: there are very likely
>nontechnical reviews of this topic, too, as well as online discussions and
>tutorials.
>
>So if you are serious about dealing with this and have a lot of data with
>these issues, my advice would be to stop looking for ad hoc advice and dig
>into the literature: it's one of the many areas of "data science" where
>seemingly simple but pervasive questions require complex answers.
>
>And, again, heed my personal caveats.
>
>Thus endeth my rant.
>
>Cheers to all,
>Bert
>
>
>
>On Mon, Jan 22, 2024 at 9:29 AM Rich Shepard <rshepard using appl-ecosys.com>
>wrote:
>
>> On Mon, 22 Jan 2024, Martin Maechler wrote:
>>
>> > I think it is a good question, not really only about geo-chemistry, but
>> > about statistics in applied sciences (and engineering for that matter).
>>
>> > John W Tukey (and several other of the grands of the time) had the log
>> > transform among the "First aid transformations":
>> >
>> > If the data for a continuous variable must all be positive it is also
>> > typically the case that the distribution is considerably skewed to the
>> > right. In such a case behave as a good human who sees another human in
>> > health distress: apply First Aid -- do the things you learned to do
>> > quickly without too much thought, because things must happen fast ---to
>> > hopefully save the other's life.
>>
>> Martin,
>>
>> Thanks very much. I will look further into this because toxic metals and
>> organic compounds in geochemical collections almost always have censored
>> lab
>> results (below method dection limits) that range from about 15% to 80% or
>> more, and there almost always are very high extreme values.
>>
>> I'll learn to understand what benefits log transforms have over
>> compositional data analyses.
>>
>> Best regards,
>>
>> Rich
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>	[[alternative HTML version deleted]]
>
>______________________________________________
>R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

-- 
Sent from my phone. Please excuse my brevity.