[Rd] subscripting a data.frame (without changing row order) changes internal row.names

Dr Gregory Jefferis jefferis at mrc-lmb.cam.ac.uk
Mon Nov 10 19:35:18 CET 2014


Dear R-devel,

Can anyone help me to understand this? It seems that subscripting the 
rows of a data.frame without actually changing their order, somehow 
changes an internal representation of row.names that is revealed by e.g. 
dput/dump/serialize

I have read the docs and inspected the (R) code for data.frame, 
rownames, row.names and dput without enlightenment.

df=data.frame(a=1:10, b=1)
dput(df)
df2=df[1:nrow(df), ]
# R thinks they are equal (so do I!)
all.equal(df, df2)
dput(df2)

Looking at the output of the dputs

> dput(df)
structure(list(a = 1:10, b = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1)), .Names = 
c("a",
"b"), row.names = c(NA, -10L), class = "data.frame")
> dput(df2)
structure(list(a = 1:10, b = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1)), .Names = 
c("a",
"b"), row.names = c(NA, 10L), class = "data.frame")

we have row.names = c(NA, -10L) in the first case and row.names = c(NA, 
10L) in the second, so somehow these objects have a different 
representation

Can anyone explain why? This has come up because

> library(digest)
> digest(df)==digest(df2)
[1] FALSE

digest uses serialize under the hood, but serialize, dput and dump all 
show the same effect (I've pasted an example below using dump, md5sum 
from base R).

Many thanks for any enlightenment! More generally is there any way to 
calculate a digest of a data.frame that could get round this issue or is 
that not possible?

Best wishes,

Greg.


A digest using base R:

library(tools)
td=tempfile()
dir.create(td)
tempfiles=file.path(td,c("df", "df2"))
dump("df",tempfiles[1])
dump("df2",tempfiles[2])
md5sum(tempfiles)

# different md5sum

> sessionInfo() # for my laptop but also observed on R 3.1.2
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] tools     stats     graphics  grDevices utils     datasets  methods  
  base

other attached packages:
[1] nat_1.5.14      nat.utils_0.4.2 digest_0.6.4    Rvcg_0.9        
devtools_1.6.1  igraph_0.7.1
[7] testthat_0.9.1  rgl_0.93.1098

loaded via a namespace (and not attached):
  [1] codetools_0.2-9   filehash_2.2-2    nabor_0.4.3       
parallel_3.1.1    plyr_1.8.1
  [6] Rcpp_0.11.3       rstudio_0.98.1062 rstudioapi_0.1    XML_3.98-1.1 
      yaml_2.1.13

--
Gregory Jefferis, PhD
Division of Neurobiology
MRC Laboratory of Molecular Biology
Francis Crick Avenue
Cambridge Biomedical Campus
Cambridge, CB2 OQH, UK

http://www2.mrc-lmb.cam.ac.uk/group-leaders/h-to-m/g-jefferis
http://jefferislab.org
http://flybrain.stanford.edu



More information about the R-devel mailing list