[R] Bug in print for data frames?

Thu Nov 2 12:27:12 CET 2023

В Wed, 25 Oct 2023 09:18:26 +0300
"Christian Asseburg" <rhelp using moin.fi> пишет:

> > str(x)  
> 'data.frame':   1 obs. of  3 variables:
>  $ A: num 1
>  $ B: num 1
>  $ C:'data.frame':      1 obs. of  1 variable:
>   ..$ A: num 1
> 
> Why does the print(x) not show "C" as the name of the third element?

Interesting problem.

print.data.frame() calls format.data.frame() to prepare its argument
for printing, which in turn calls as.data.frame.list() to reconstruct a
data.frame from the formatted arguments, which in turn uses
data.frame() to actually construct the object.

data.frame() is able to return combined column names, but only if the
inner data.frame has more than one column:

names(data.frame(A = 1:3, B = data.frame(C = 4:6, D = 7:9)))
# [1] "A"   "B.C" "B.D"
names(data.frame(A = 1:3, B = data.frame(C = 4:6)))
# [1] "A" "C"

This matches the behaviour documented in ?data.frame:

>> For a named or unnamed matrix/list/data frame argument that contains
>> a single column, the column name in the result is the column name in
>> the argument.

Still, changing the presentational code like print.data.frame() or
format.data.frame() could be safe. I've tried writing a patch for
format.data.frame(), but it looks clumsy and breaks regression tests
(that do actually check capture.output()):

--- src/library/base/R/format.R (revision 85459)
+++ src/library/base/R/format.R (working copy)
@@ -243,8 +243,16 @@
     if(!nc) return(x) # 0 columns: evade problems, notably for nrow() > 0
     nr <- .row_names_info(x, 2L)
     rval <- vector("list", nc)
-    for(i in seq_len(nc))
+    for(i in seq_len(nc)) {
        rval[[i]] <- format(x[[i]], ..., justify = justify)
+       # avoid data.frame(foo = data.frame(bar = ...)) overwriting
+       # the single column name
+       if (
+           identical(ncol(rval[[i]]), 1L) &&
+           !is.null(colnames(rval[[i]])) &&
+           colnames(rval[[i]]) != ''
+       ) colnames(rval[[i]]) <- paste(names(x)[[i]], colnames(rval[[i]]), sep = '.')
+    }
     lens <- vapply(rval, NROW, 1)
     if(any(lens != nr)) { # corrupt data frame, must have at least one column
        warning("corrupt data frame: columns will be truncated or
        padded with NAs")

Is it worth changing the behaviour of {print,format}.data.frame() (and
fixing the regression tests to accept the new behaviour), or would that
break too much?

-- 
Best regards,
Ivan