[Rd] special handling of row.names

Wed Apr 2 08:52:22 CEST 2014

Hello, 

I think there is an inconsistency in the handling of the compact form of the row.names attributes. 

When n is the number of rows of a data.frame, the compact form is c(NA_integer_,-n), as in: 

> d <- data.frame(x=1:10)
> .Internal(inspect(d))
@104f174a8 19 VECSXP g0c1 [OBJ,NAM(2),ATT] (len=1, tl=0)
  @103a7dc60 13 INTSXP g0c4 [] (len=10, tl=0) 1,2,3,4,5,...
ATTRIB:
  @104959380 02 LISTSXP g0c0 []
    TAG: @100823078 01 SYMSXP g1c0 [MARK,LCK,gp=0x4000] "names" (has value)
    @104f17748 16 STRSXP g0c1 [NAM(2)] (len=1, tl=0)
      @10085c678 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "x"
    TAG: @10082d060 01 SYMSXP g1c0 [MARK,LCK,gp=0x4000] "row.names" (has value)
    @104f0e898 13 INTSXP g0c1 [] (len=2, tl=0) -2147483648,-10
    TAG: @100823548 01 SYMSXP g1c0 [MARK,LCK,gp=0x4000] "class" (has value)
    @104f0e8c8 16 STRSXP g0c1 [NAM(1)] (len=1, tl=0)
      @1008a7e60 09 CHARSXP g1c2 [MARK,gp=0x61,ATT] [ASCII] [cached] "data.frame"

But then, -10 becomes 10: 

> d2 <- structure( d, class = "data.frame" )
> .Internal(inspect(d2))
@104f08898 19 VECSXP g0c1 [OBJ,NAM(2),ATT] (len=1, tl=0)
  @103a7dc60 13 INTSXP g0c4 [NAM(2)] (len=10, tl=0) 1,2,3,4,5,...
ATTRIB:
  @104956150 02 LISTSXP g0c0 []
    TAG: @100823078 01 SYMSXP g1c0 [MARK,LCK,gp=0x4000] "names" (has value)
    @104f087a8 16 STRSXP g0c1 [] (len=1, tl=0)
      @10085c678 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "x"
    TAG: @10082d060 01 SYMSXP g1c0 [MARK,LCK,gp=0x4000] "row.names" (has value)
    @104f088c8 13 INTSXP g0c1 [] (len=2, tl=0) -2147483648,10
    TAG: @100823548 01 SYMSXP g1c0 [MARK,LCK,gp=0x4000] "class" (has value)
    @104f08838 16 STRSXP g0c1 [] (len=1, tl=0)
      @1008a7e60 09 CHARSXP g1c2 [MARK,gp=0x61,ATT] [ASCII] [cached] "data.frame"

This happens in row_names_gets (attrib.c), here: 

	if(OK_compact) {
	    /* we hide the length in an impossible integer vector */
	    PROTECT(val = allocVector(INTSXP, 2));
	    INTEGER(val)[0] = NA_INTEGER;
	    INTEGER(val)[1] = n;
	    ans =  installAttrib(vec, R_RowNamesSymbol, val);
	    UNPROTECT(1);
	    return ans;
	}

I believe it should be INTEGER(val)[1] = -n; for consistency. 

BTW, perhaps structure should be internalized to prevent special handling of row.names when it does not make sense. Here is structure: 

structure
function (.Data, ...)
{
    attrib <- list(...)
    if (length(attrib)) {
        specials <- c(".Dim", ".Dimnames", ".Names", ".Tsp",
            ".Label")
        replace <- c("dim", "dimnames", "names", "tsp", "levels")
        m <- match(names(attrib), specials)
        ok <- (!is.na(m) & m)
        names(attrib)[ok] <- replace[m[ok]]
        if ("factor" %in% attrib[["class", exact = TRUE]] &&
            typeof(.Data) == "double")
            storage.mode(.Data) <- "integer"
        attributes(.Data) <- c(attributes(.Data), attrib)
    }
    return(.Data)
}

When I do structure( d, class = "data.frame" ), eventually this line is executed: 

attributes(.Data) <- c(attributes(.Data), attrib)

So, first attributes are retrieved, and because of R special handling of row.names, it gets promoted to 1:n in getAttrib. 

Then, we want to set the row.names attribute, special handling again in setAttrib, leading to row_names_gets, R actually loops over the attribute to check if it is of the form 1:n, and if it is it brings back the compact form (making a mistake along the way). 

This looks like a waste of resources. 

Romain