[Rd] 1954 from NA

Bertram, Alexander @|ex @end|ng |rom bed@t@dr|ven@com
Mon May 24 14:29:44 CEST 2021


Dear Adrian,
I just wanted to pipe in and underscore Thomas' point: the payload bits of
IEEE 754 floating point values are no place to store data that you care
about or need to keep. That is not only related to the R APIs, but also how
processors handle floating point values and signaling and non-signaling
NaNs. It is very difficult to reason about when and under which
circumstances these bits are preserved. I spent a lot of time working on
Renjin's handling of these values and I can assure that any such scheme
will end in tears.

A far, far better option is to use R's attributes to store this kind of
metadata. This is exactly what this language feature is for. There is
already a standard 'levels' attribute that holds the labels of factors like
"Yes", "No" , "Refused", "Interviewer error'' etc. In the past, I've worked
on projects where we stored an additional attribute like "missingLevels"
that stores extra metadata on which levels should be used in which kind of
analysis. That way, you can preserve all the information, and then write a
utility function which automatically applies certain logic to a whole
dataframe just before passing the data to an analysis function. This is
also important because in surveys like this, different values should be
excluded at different times. For example, you might want to include all
responses in a data quality report, but exclude interviewer error and
refusals when conducting a PCA or fitting a model.

Best,
Alex

On Mon, May 24, 2021 at 2:03 PM Adrian Dușa <dusa.adrian using gmail.com> wrote:

> On Mon, May 24, 2021 at 1:31 PM Tomas Kalibera <tomas.kalibera using gmail.com>
> wrote:
>
> > [...]
> >
> > For the reasons I explained, I would be against such a change. Keeping
> the
> > data on the side, as also recommended by others on this list, would allow
> > you for a reliable implementation. I don't want to support fragile
> package
> > code building on unspecified R internals, and in this case particularly
> > internals that themselves have not stood the test of time, so are at high
> > risk of change.
> >
> I understand, and it makes sense.
> We'll have to wait for the R internals to settle (this really is
> surprising, I wonder how other software have solved this). In the meantime,
> I will probably go ahead with NaNs.
>
> Thank you again,
> Adrian
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>


-- 
Alexander Bertram
Technical Director
*BeDataDriven BV*

Web: http://bedatadriven.com
Email: alex using bedatadriven.com
Tel. Nederlands: +31(0)647205388
Skype: akbertram

	[[alternative HTML version deleted]]



More information about the R-devel mailing list