[R] Corrupt data frame construction - bug?

Duncan Murdoch murdoch at stats.uwo.ca
Thu Apr 30 09:53:26 CEST 2009


On 29/04/2009 9:21 PM, Steven McKinney wrote:
> Thanks Duncan,
> 
> Comments and a proposed bug fix in-line below:

Thanks; sorry for the misinformation about the $ method.

I'm not going to have time today to look at the patch, but will check it 
out tomorrow, unless someone else gets there first.

Duncan Murdoch

>  
> 
>> -----Original Message-----
>> From: Duncan Murdoch [mailto:murdoch at stats.uwo.ca]
>> Sent: Wednesday, April 29, 2009 5:10 PM
>> To: Steven McKinney
>> Cc: R-help at r-project.org
>> Subject: Re: [R] Corrupt data frame construction - bug?
>>
>> On 29/04/2009 6:41 PM, Steven McKinney wrote:
>>> Hi useRs,
>>>
>>> A recent coding infelicity along these lines yielded a corrupt data
>>> frame.
>>>
>>> foo <- matrix(1:12, nrow = 3)
>>> bar <- data.frame(foo)
>>> bar$NewCol <- foo[foo[, 1] == 4, 4]
>>> bar
>>> lapply(bar, length)
>>>
>>>
>>>
>>>
>>>> foo <- matrix(1:12, nrow = 3)
>>>> bar <- data.frame(foo)
>>>> bar$NewCol <- foo[foo[, 1] == 4, 4]
>>>> bar
>>>   X1 X2 X3 X4 NewCol
>>> 1  1  4  7 10   <NA>
>>> 2  2  5  8 11   <NA>
>>> 3  3  6  9 12   <NA>
>>> Warning message:
>>> In format.data.frame(x, digits = digits, na.encode = FALSE) :
>>>   corrupt data frame: columns will be truncated or padded with NAs
>>>> lapply(bar, length)
>>> $X1
>>> [1] 3
>>>
>>> $X2
>>> [1] 3
>>>
>>> $X3
>>> [1] 3
>>>
>>> $X4
>>> [1] 3
>>>
>>> $NewCol
>>> [1] 0
>>>
>>>
>>> Is this a bug in the data.frame machinery?
>>> If an attempt is made to add a new column to a data frame, and the
>> new
>>> object does not have length = number of rows of data frame, or
> cannot
>>> be made to have such length via recycling, shouldn't an error be
>>> thrown?
>>>
>>> Instead in this example I end up with a "corrupt data frame" having
>>> one zero-length column.
>>>
>>>
>>> Should this be reported as a bug, or did I misinterpret the
>>> documentation?
>> I don't think "$" uses any data.frame machinery.  You are working at a
>> lower level.
>>
>> If you had added the new column using
>>
>> bar <- data.frame(bar, NewCol=foo[foo[, 1] == 4, 4])
>>
>> you would have seen the error:
>>
>> Error in data.frame(bar, NewCol = foo[foo[, 1] == 4, 4]) :
>>    arguments imply differing number of rows: 3, 0
>>
>> But since you treated it as a list, it let you go ahead and create
>> something that was labelled as a data.frame but wasn't.  This is one
> of
>> the reasons some people prefer S4 methods:  it's easier to protect
>> against people who mislabel things.
>>
> 
> I did some more digging on '$' - there is a data.frame method for it:
> 
>> getAnywhere("$<-.data.frame" )
> A single object matching '$<-.data.frame' was found
> It was found in the following places
>   package:base
>   registered S3 method for $<- from namespace base
>   namespace:base
> with value
> 
> function (x, i, value) 
> {
>     cl <- oldClass(x)
>     class(x) <- NULL
>     nrows <- .row_names_info(x, 2L)
>     if (!is.null(value)) {
>         N <- NROW(value)
>         if (N > nrows) 
>             stop(gettextf("replacement has %d rows, data has %d", 
>                 N, nrows), domain = NA)
>         if (N < nrows && N > 0L) 
>             if (nrows%%N == 0L && length(dim(value)) <= 1L) 
>                 value <- rep(value, length.out = nrows)
>             else stop(gettextf("replacement has %d rows, data has %d", 
>                 N, nrows), domain = NA)
>         if (is.atomic(value)) 
>             names(value) <- NULL
>     }
>     x[[i]] <- value
>     class(x) <- cl
>     return(x)
> }<environment: namespace:base>
> 
> 
> I placed a browser() command before return(x) and did some poking
> around.
> 
> It seems to me there's a bug in this function.  It should be able to
> detect the problem I threw at it, and throw an error as you point out is
> thrown by the other data.frame assign method.
> 
> 
> I modified the rows
>           if (N < nrows && N > 0L) 
>             if (nrows%%N == 0L && length(dim(value)) <= 1L)
> to read
>            if (N < nrows) 
>             if (N > 0L && nrows%%N == 0L && length(dim(value)) <= 1L)
> 
> as in
> 
> "$<-.data.frame" <-
> function (x, i, value) 
> {
>     cl <- oldClass(x)
>     class(x) <- NULL
>     nrows <- .row_names_info(x, 2L)
>     if (!is.null(value)) {
>         N <- NROW(value)
>         if (N > nrows) 
>             stop(gettextf("replacement has %d rows, data has %d", 
>                 N, nrows), domain = NA)
>         if (N < nrows) 
>             if (N > 0L && nrows%%N == 0L && length(dim(value)) <= 1L) 
>                 value <- rep(value, length.out = nrows)
>             else stop(gettextf("replacement has %d rows, data has %d", 
>                 N, nrows), domain = NA)
>         if (is.atomic(value)) 
>             names(value) <- NULL
>     }
>     x[[i]] <- value
>     class(x) <- cl
>     return(x)
> } 
> 
> Now it detects the problem I created, in the fashion you demonstrated
> above for the replacement using data.frame().
> 
>> foo <- matrix(1:12, nrow = 3)
>> bar <- data.frame(foo)
>> bar$NewCol <- foo[foo[, 1] == 4, 4]
> Error in `$<-.data.frame`(`*tmp*`, "NewCol", value = integer(0)) : 
>   replacement has 0 rows, data has 3
> 
> It doesn't appear to stumble on weird data frames (these from the
> ?data.frame help page)
> 
> 
>> L3 <- LETTERS[1:3]
>> (d <- data.frame(cbind(x=1, y=1:10), fac=sample(L3, 10,
> replace=TRUE)))
>> (d0  <- d[, FALSE]) # NULL data frame with 10 rows
>  
>> (d.0 <- d[FALSE, ]) # <0 rows> data frame  (3 cols)
> 
>> (d00 <- d0[FALSE,])  # NULL data frame with 0 rows
>  
>> d0$NewCol <- foo[foo[, 1] == 4, 4]
> Error in `$<-.data.frame`(`*tmp*`, "NewCol", value = integer(0)) : 
>   replacement has 0 rows, data has 10
> 
> ### Catches this problem above alright.
> 
>> d.0$NewCol <- foo[foo[, 1] == 4, 4]
>> d.0
> [1] x      y      fac    NewCol
> <0 rows> (or 0-length row.names)
> 
> ### Lets the above one through alright.
> 
>> d00$NewCol <- foo[foo[, 1] == 4, 4]
>>
>> d00
> [1] NewCol
> <0 rows> (or 0-length row.names)
> ### Lets the above one through alright.
> 
> 
> Would the above modification work to fix this problem?
> 
> 
> 
> 
> 
>> Duncan Murdoch
>>
>>>
>>>
>>>
>>>> sessionInfo()
>>> R version 2.9.0 (2009-04-17)
>>> powerpc-apple-darwin8.11.1
>>>
>>> locale:
>>> en_CA.UTF-8/en_CA.UTF-8/C/C/en_CA.UTF-8/en_CA.UTF-8
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> other attached packages:
>>> [1] nlme_3.1-90
>>>
>>> loaded via a namespace (and not attached):
>>> [1] grid_2.9.0      lattice_0.17-22 tools_2.9.0
>>>
>>>
>>> Also occurs on Windows box with R 2.8.1
>>>
>>>
>>>
>>> Steven McKinney
>>>
>>> Statistician
>>> Molecular Oncology and Breast Cancer Program British Columbia Cancer
>>> Research Centre
>>>
>>> email: smckinney +at+ bccrc +dot+ ca
>>>
>>> tel: 604-675-8000 x7561
>>>
>>> BCCRC
>>> Molecular Oncology
>>> 675 West 10th Ave, Floor 4
>>> Vancouver B.C.
>>> V5Z 1L3
>>> Canada
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>




More information about the R-help mailing list