[Rd] subscripting a data.frame (without changing row order) changes internal row.names

Kevin Ushey kevinushey at gmail.com
Mon Nov 10 23:21:57 CET 2014


I believe the question here is related to the sign on the compact row
names representation: why is it sometimes `c(NA, <positive>)` and
sometimes `c(NA, <negative>)` -- why the difference in sign?

To the best of my knowledge, older versions of R used the signed-ness
of compact row.names to differentiate between different 'types' of
data.frames, but that should no longer be necessary. Unless there is
some reason not to, I believe R should standardize on one
representation, and consider it a bug if the other is seen.

Of course, I could be wrong, so I only offer my understanding only as
a way of invoking Cunningham's law...

Cheers,
Kevin

On Mon, Nov 10, 2014 at 12:05 PM, Joshua Ulrich <josh.m.ulrich at gmail.com> wrote:
> On Mon, Nov 10, 2014 at 12:35 PM, Dr Gregory Jefferis
> <jefferis at mrc-lmb.cam.ac.uk> wrote:
>> Dear R-devel,
>>
>> Can anyone help me to understand this? It seems that subscripting the rows
>> of a data.frame without actually changing their order, somehow changes an
>> internal representation of row.names that is revealed by e.g.
>> dput/dump/serialize
>>
>> I have read the docs and inspected the (R) code for data.frame, rownames,
>> row.names and dput without enlightenment.
>>
> Look at ?.row_names_info (which is mentioned in the See Also section
> of ?row.names) and its type argument.  Also see the discussion here:
> http://stackoverflow.com/q/26468746/271616
>
>> df=data.frame(a=1:10, b=1)
>> dput(df)
>> df2=df[1:nrow(df), ]
>> # R thinks they are equal (so do I!)
>> all.equal(df, df2)
>> dput(df2)
>>
>> Looking at the output of the dputs
>>
>>> dput(df)
>>
>> structure(list(a = 1:10, b = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1)), .Names =
>> c("a",
>> "b"), row.names = c(NA, -10L), class = "data.frame")
>>>
>>> dput(df2)
>>
>> structure(list(a = 1:10, b = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1)), .Names =
>> c("a",
>> "b"), row.names = c(NA, 10L), class = "data.frame")
>>
>> we have row.names = c(NA, -10L) in the first case and row.names = c(NA, 10L)
>> in the second, so somehow these objects have a different representation
>>
>> Can anyone explain why? This has come up because
>>
> The first are "automatic".  The second are a compact form of 1:10, as
> mentioned in ?row.names.  I'm not certain of the root cause/reason,
> but the second object will not have "automatic" rownames because you
> have subset it with a non-missing 'i'.
>
>>> library(digest)
>>> digest(df)==digest(df2)
>>
>> [1] FALSE
>>
>> digest uses serialize under the hood, but serialize, dput and dump all show
>> the same effect (I've pasted an example below using dump, md5sum from base
>> R).
>>
>> Many thanks for any enlightenment! More generally is there any way to
>> calculate a digest of a data.frame that could get round this issue or is
>> that not possible?
>>
>> Best wishes,
>>
>> Greg.
>>
>>
>> A digest using base R:
>>
>> library(tools)
>> td=tempfile()
>> dir.create(td)
>> tempfiles=file.path(td,c("df", "df2"))
>> dump("df",tempfiles[1])
>> dump("df2",tempfiles[2])
>> md5sum(tempfiles)
>>
>> # different md5sum
>>
>>> sessionInfo() # for my laptop but also observed on R 3.1.2
>>
>> R version 3.1.1 (2014-07-10)
>> Platform: x86_64-apple-darwin13.1.0 (64-bit)
>>
>> locale:
>> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
>>
>> attached base packages:
>> [1] tools     stats     graphics  grDevices utils     datasets  methods
>> base
>>
>> other attached packages:
>> [1] nat_1.5.14      nat.utils_0.4.2 digest_0.6.4    Rvcg_0.9
>> devtools_1.6.1  igraph_0.7.1
>> [7] testthat_0.9.1  rgl_0.93.1098
>>
>> loaded via a namespace (and not attached):
>>  [1] codetools_0.2-9   filehash_2.2-2    nabor_0.4.3       parallel_3.1.1
>> plyr_1.8.1
>>  [6] Rcpp_0.11.3       rstudio_0.98.1062 rstudioapi_0.1    XML_3.98-1.1
>> yaml_2.1.13
>>
>> --
>> Gregory Jefferis, PhD
>> Division of Neurobiology
>> MRC Laboratory of Molecular Biology
>> Francis Crick Avenue
>> Cambridge Biomedical Campus
>> Cambridge, CB2 OQH, UK
>>
>> http://www2.mrc-lmb.cam.ac.uk/group-leaders/h-to-m/g-jefferis
>> http://jefferislab.org
>> http://flybrain.stanford.edu
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>
>
> --
> Joshua Ulrich  |  about.me/joshuaulrich
> FOSS Trading  |  www.fosstrading.com
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list