[Rd] Wrong number of names?

Martin Maechler m@ech|er @end|ng |rom @t@t@m@th@ethz@ch
Mon Nov 1 14:10:08 CET 2021


>>>>> Duncan Murdoch 
>>>>>     on Mon, 1 Nov 2021 06:36:17 -0400 writes:

    > The StackOverflow post
    > https://stackoverflow.com/a/69767361/2554330 discusses a
    > dataframe which has a named numeric column of length 1488
    > that has 744 names. I don't think this is ever legal, but
    > am I wrong about that?

    > The `dat.rds` file mentioned in the post is temporarily
    > available online in case anyone else wants to examine it.

    > Assuming that the file contains a badly formed object, I
    > wonder if readRDS() should do some sanity checks as it
    > reads.

    > Duncan Murdoch

Good question.

In the mean time, I've also added a bit on the SO page
above.. e.g.

---------------------------------------------------------------------------

d <- readRDS("<.....>dat.rds")
str(d)
## 'data.frame':	1488 obs. of  4 variables:
##  $ facet_var: chr  "AUT" "AUT" "AUT" "AUT" ...
##  $ date     : Date, format: "2020-04-26" "2020-04-27" ...
##  $ variable : Factor w/ 2 levels "arima","prophet": 1 1 1 1 1 1 1 1 1 1 ...
##  $ score    : Named num  2.74e-06 2.41e-06 2.48e-06 2.39e-06 2.79e-06 ...
##   ..- attr(*, "names")= chr [1:744] "new_confirmed10" "new_confirmed10" "new_confirmed10" "new_confirmed10" ...

ds <- d$score
c(length(ds), length(names(ds)))
## 1488   744

dput(ds) # -> 

##  *** caught segfault ***
## address (nil), cause 'memory not mapped'

---------------------------------------------------------------------------

Hence  "proving" that the dat.rds  really contains an invalid object,
when simple  dput(.) directly gives a segmentation fault.

I think we are aware that using C code and say .Call(..)  one
can create all kinds of invalid objects "easily".. and I think
it's clear that it's not feasible to check for validity of such
objects "everwhere".

Your proposal to have at least our deserialization code used in
readRDS() do (at least *some*) validity checks seems good, but
maybe we should think of more cases, and / or  do such validity
checks already during serialization { <-> saveRDS() here } ?

.. Such questions then really are for those who understand more than
me about (de)serialization in R, its performance bottlenecks etc.
Given the speed impact we should probably have such checks *optional*
but have them *on* by default e.g., at least for saveRDS() ?

Martin



More information about the R-devel mailing list