[R] Use of geometric mean .. in good data analysis

Wed Jan 24 01:22:40 CET 2024

I've advised people consulting me that if their data is loaded with 
zeros, while they are absolutely certain that something should be where 
the zeros are, then they either need a better measuring tool, or to 
carefully document the results of limits on detectability and then note 
what fraction of the data is really below instrument limits.  It's 
important information as it stands, but they don't want to go writing 
fairy tales based on things not seen.

On 1/22/24 12:57, Jeff Newmiller via R-help wrote:

> Still OT... but here is my own (I think previously mentioned here) rant on people thrashing about with log transformation and an all-too-common kludge to deal with zeros mixed among small numbers...https://gist.github.com/jdnewmil/99301a88de702ad2fcbaef33326b08b4
>
> OP perhaps posting a link here to your question posed wherever you end up with it will help shorten this thread.
>
> On January 22, 2024 12:23:20 PM PST, Bert Gunter<bgunter.4567 using gmail.com>  wrote:
>> Ah.... LOD's, typically LLOD's ("lower limits of detection").
>>
>> Disclaimer: I am *NOT* in any sense an expert on such matters. What follows
>> are just some comments based on my personal experience. Please filter
>> accordingly. Also, while I kept it on list as Martin suggested it might be
>> useful to do so, most folks probably can safely ignore the rant that
>> follows as off topic and not of interest. So you've been warned!!
>>
>> The rant:
>> My experience is: data that contain a "bunch" of values that are, e.g.
>> below a LLOD, are frequently reported and/or analyzed by various ad hoc,
>> and imho, uniformly bad methods. e.g.:
>>
>> 1) The censored values are recorded and analyzed as at the LLOD;
>> 2) The censored values are recorded and analyzed at some arbitrary value
>> below the LLOD, like LLOD/2;
>> 3) The censored values are are "imputed" by ad hoc methods, e.g. uniform
>> random values between 0 and the LLOD for left censoring.
>>
>> To repeat, *IMO*, all of this is junk and will produced misleading
>> statistical results. Whether they mislead enough to substantively affect
>> the science or regulatory decisions depend on the specifics of the
>> circumstances. I accept no general claim as to their innocuousness.
>>
>> Further:
>>
>> a) When you have a "lot" of values -- 50%? 75%?, 25%? -- face facts: you
>> have (practically) no useful information from the values that you do have
>> to infer what the distribution of values that you don't have looks like.
>> All one can sensibly do is say that x% of the values are below a LOD and
>> here's the distribution of what lies above. Presumably, if you have such
>> data conditional on covariates with the obvious intent to determine the
>> relationship to those covariates, you could analyze the percentages of
>> LLOD's and known values separately. There are undoubtedly more
>> sophisticated methods out there, so this is where you need to go to the
>> literature to see what might suit; though I think it will still have to
>> come down to looking at these separately (e.g. with extra parameters to
>> account for unmeasurable values). Another way of saying this is: any
>> analysis which treats all the data as arising from a single distribution
>> will depend more on the assumptions you make than on the data. So good luck
>> with that!
>>
>> b) If you have a "modest" amount of (known) censoring -- 5%?, 20%? 10%? --
>> methods for the analysis of censored data should be useful. My
>> understanding is that MI (multiple imputation) is regarded as a generally
>> useful approach, and there are many R packages that can do various flavors
>> of this. Again, you should consult the literature: there are very likely
>> nontechnical reviews of this topic, too, as well as online discussions and
>> tutorials.
>>
>> So if you are serious about dealing with this and have a lot of data with
>> these issues, my advice would be to stop looking for ad hoc advice and dig
>> into the literature: it's one of the many areas of "data science" where
>> seemingly simple but pervasive questions require complex answers.
>>
>> And, again, heed my personal caveats.
>>
>> Thus endeth my rant.
>>
>> Cheers to all,
>> Bert
>>
>>
>>
>> On Mon, Jan 22, 2024 at 9:29 AM Rich Shepard<rshepard using appl-ecosys.com>
>> wrote:
>>
>>> On Mon, 22 Jan 2024, Martin Maechler wrote:
>>>
>>>> I think it is a good question, not really only about geo-chemistry, but
>>>> about statistics in applied sciences (and engineering for that matter).
>>>> John W Tukey (and several other of the grands of the time) had the log
>>>> transform among the "First aid transformations":
>>>>
>>>> If the data for a continuous variable must all be positive it is also
>>>> typically the case that the distribution is considerably skewed to the
>>>> right. In such a case behave as a good human who sees another human in
>>>> health distress: apply First Aid -- do the things you learned to do
>>>> quickly without too much thought, because things must happen fast ---to
>>>> hopefully save the other's life.
>>> Martin,
>>>
>>> Thanks very much. I will look further into this because toxic metals and
>>> organic compounds in geochemical collections almost always have censored
>>> lab
>>> results (below method dection limits) that range from about 15% to 80% or
>>> more, and there almost always are very high extreme values.
>>>
>>> I'll learn to understand what benefits log transforms have over
>>> compositional data analyses.
>>>
>>> Best regards,
>>>
>>> Rich
>>>
>>> ______________________________________________
>>> R-help using r-project.org  mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>> 	[[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help using r-project.org  mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
	[[alternative HTML version deleted]]