[R] Keep value lables with data frame manipulation

Thu Jul 13 15:48:55 CEST 2006

At 08:11 13.07.2006 -0500, Frank E Harrell Jr wrote:
>Heinz Tuechler wrote:
>> At 13:14 12.07.2006 -0500, Marc Schwartz (via MN) wrote:
>>> On Wed, 2006-07-12 at 17:41 +0100, Jol, Arne wrote:
>>>> Dear R,
>>>>
>>>> I import data from spss into a R data.frame. On this rawdata I do some
>>>> data processing (selection of observations, normalization, recoding of
>>>> variables etc..). The result is stored in a new data.frame, however, in
>>>> this new data.frame the value labels are lost.
>>>>
>>>> Example of what I do in code:
>>>>
>>>> # read raw data from spss
>>>> rawdata <- read.spss("./data/T50937.SAV",
>>>> 	use.value.labels=FALSE,to.data.frame=TRUE)
>>>>
>>>> # select the observations that we need
>>>> diarydata <- rawdata[rawdata$D22==2 | rawdata$D22==3 | rawdata$D22==17 |
>>>> rawdata$D22==18 | rawdata$D22==20 | rawdata$D22==22 |
>>>>  			rawdata$D22==24 | rawdata$D22==33,]
>>>>
>>>> The result is that rawdata$D22 has value labels and that diarydata$D22
>>>> is numeric without value labels.
>>>>
>>>> Question: How can I prevent this from happening?
>>>>
>>>> Thanks in advance!
>>>> Groeten,
>>>> Arne
>>> Two things:
>>>
>>> 1. With respect to your subsetting, your lengthy code can be replaced
>>> with the following:
>>>
>>>  diarydata <- subset(rawdata, D22 %in% c(2, 3, 17, 18, 20, 22, 24, 33))
>>>
>>> See ?subset and ?"%in%" for more information.
>>>
>>>
>>> 2. With respect to keeping the label related attributes, the
>>> 'value.labels' attribute and the 'variable.labels' attribute will not by
>>> default survive the use of "[".data.frame in R (see ?Extract
>>> and ?"[.data.frame").
>>>
>>> On the other hand, based upon my review of ?read.spss, the SPSS value
>>> labels should be converted to the factor levels of the respective
>>> columns when 'use.value.labels = TRUE' and these would survive a
>>> subsetting.
>>>
>>> If you want to consider a solution to the attribute subsetting issue,
>>> you might want to review the following post by Gabor Grothendieck in
>>> May, which provides a possible solution:
>>>
>>>  https://stat.ethz.ch/pipermail/r-help/2006-May/106308.html
>>>
>>> and this post by me, for an explanation of what is happening in Gabor's
>>> solution:
>>>
>>>  https://stat.ethz.ch/pipermail/r-help/2006-May/106351.html
>>>
>>> HTH,
>>>
>>> Marc Schwartz
>>>
>> Hello Mark and Arne,
>> 
>> I worked on the suggestions of Gabor and Mark and programmed some functions
>> in this way, but they are very, very preliminary (see below).
>> In my view there is a lack of convenient possibilities in R to document
>> empirical data by variable labels, value labels, etc. I would prefer to
>> have these possibilities in the "standard" configuration.
>> So I sketched a concept, but in my view it would only be useful, if there
>> was some acceptance by the core developers of R.
>> 
>> The concept would be to define a class. For now I call it "source.data".
>> To design it more flexible than the Hmisc class "labelled" I would define a
>> related option "source.data.attributes" with default c('value.labels',
>> 'variable.name', 'label')). This option contains all attributes that should
>> persist in subsetting/indexing.
>> 
>> I made only some very, very preliminary tests with these functions, mainly
>> because I am not happy with defining a new class. Instead I would prefer,
>> if this functionality could be integrated in the Hmisc class "labelled",
>> since this is in my view the best known starting point for data
>> documentation in R.
>> 
>> I would be happy, if there were some discussion about the wishes/needs of
>> other Rusers concerning data documentation.
>> 
>> Greetings,
>> 
>> Heinz
>
>I feel that separating variable labels and value labels and just using 
>factors for value labels works fine, and I would urge you not to create 
>a new system that will not benefit from the many Hmisc functions that 
>use variable labels and units.  [.data.frame in Hmisc keeps all attributes.
>
>Frank
>

Frank,

of course I aggree with you about the importance of Hmisc and as I said, I
do not want to define a new class, but in my view factors are no good
substitute for value labels.
As the language definition (version 2.3.1 (2006-06-05) Draft, page 7) says:
"Factors are currently implemented using an integer array to specify the
actual levels and a second array of names that are mapped to the integers.
Rather unfortunately users often make use of the implementation in order to
make some calculations easier." 
So, in my view, the levels represent the "values" of the factor.
This has inconveniencies if you want to use value labels in different
languages. Further I do not see a simple method to label numerical
variables. I often encounter discrete, but still metric data, as e.g. risk
scores. Usually it would be nice to use them in their original coding,
which may include zero or decimal places and to label them at the same time.
Personally at the moment I try to solve this problem by following a
suggestion of Martin, Dimitis and others to use names instead. I doubt,
however, that this is a good solution, but at least it makes it possible to
have the source data numerically coded and in this sense "language free"
(see first attempts of functions below).

Heinz

### These are very preliminary and untested versions.
### They are inteded only to demonstrate the concept, but not for productive
### work.

### function "value.names<-" - version 0.3.0 - 11.7.2006
### function to assign names of elements according to their value
##
##  value.names<-
##  - arguments:
##    - action 
##      - set:           alle eventuell vorhandenen names löschen, valuenames
##                       setzen
##      - add.overwrite: leere und nicht leere names durch neue ersetzen
##      - add:           nur leere names durch neue ersetzen
##    - tolerance:       ordnet names den values innerhalb der Toleranz zu.
##                       Liegt ein Wert innerhalb des Toleranzbereiches
##                       mehrerer names, dann wird geringste Toleranz gewählt.
##    - round:           rounds values in value before matching
##                       This may lead to collapsing of different names in
##                       value to one name (and one value)
##    - col.str:         string used when collapsing several names
##    - others:          name for values not named by other names
##    - value:
##      
##
##  function description:
##  - x must be atomic, preferably numeric or character
##  - if tolerance is given, it must not be NA. tolerance < 0 is ignored
##  - to ensure consistency, value is processed by value.names()
##  - new.names are built by matching with/without tolerance
##  - new.names are assigned to names depending on argument action
##  - if argument others is given, others-name is assigned to all valid values
##    without name
##

"value.names<-" <- function(x, action='set', tolerance=NULL, round=NULL,
                            col.str=' ', others=NULL, value)
{
  ## checking parameters
  if(!is.atomic(x)) stop('x must be an atomic object')
  if(!is.null(tolerance) &&
     is.na(tolerance)) stop('if given, tolerance must not be NA')
  ## to ensure consistency, process value by value.names
  value <- value.names(value, round=round, col.str=col.str)
  ## delete values with NA-name from value
  value <- value[!is.na(names(value))]
  old.names <- names(x) # store original names
  ## -- building names
  ##    - matching with/without tolerance
  if(!is.null(tolerance) && tolerance > 0 && is.numeric(x))
    ##      - matching with tolerance
    { dif <- abs(outer(x, value, '-'))
      dif[dif>tolerance] <- NA
      within.tolerance <- apply(dif, 1, function(x) sum(!is.na(x)))
      old.option.warn <- options('warn')[[1]]
      options(warn=-1)
      min.dif <- apply(dif, 1, function(x) which(x==min(x, na.rm=TRUE))[1])
      options(warn=old.option.warn)
      new.names <- names(value)[min.dif] }
  else
    ##      - matching without tolerance, i.e. exact matching
    new.names <- names(value)[match( x, value)]
  ##      - matching names for NA-values
  if(length(names(value[is.na(value)]))==1)
    new.names[is.na(x)] <- names(value[is.na(value)])
  ## assign names depending on action
  if (action=='set') new.names <- new.names
  if (action=='add.overwrite') new.names[is.na(new.names)] <-
    old.names[is.na(new.names)]
  if (action=='add') new.names[!is.na(old.names)] <-
    old.names[!is.na(old.names)]
  ## assigning others-name to all valid values without name
  if (!is.null(others)) new.names[!is.na(x) & is.na(new.names)] <-
    as.character(others)
  names(x) <- new.names
  return(x)
}

### function value.names - version 0.3.0 - 11.7.2006
### function to return names of elements according to their value
##
##  - arguments:
##    - x         source vector with names for (some) elements
##                x must be atomic ().
##                If x is a factor, value will be a factor. Consequently
##                names are only seen, if unclass() or print.default is used.
##    - col.str:         string used when collapsing several names
##                       default: "/"
##    - round:           rounds values in x
##                       This may lead to collapsing of different names for
##                       one value of x to one name (and one value)
##
##  - value:
##  - vector of the same class as x with sorted unique values and their names,
##    NULL, if x is NULL
##    - NA-values in x appear at the end
##    - if there is a 1:1 realtion between values and names in x, value
##      contains all unique combinations of value and name.
##    - if identical values in x have different (non NA), names these names
##      get collapsed to one new name, seperated by the string col.str
##      This applies also to NA-values in x with different names.
##    - NA-names get suppressed, if non-NA-names for the same x-value exist.
##    - Differen values in x with identical names remain seperated.
##    - values in x without name appear in value with name NA

value.names <- function(x, col.str=' ', round=NULL) {
  ## checking parameters
  if(!is.atomic(x)) stop('x must be an atomic object')
## -- define function for pasting unique non empty names
  pasteunique <- function(names.i, col.str)
    { names.i <- sort(unique(names.i))
      names.i <- names.i[!names.i=='' & !is.na(names.i)] # exclude ''
      if (length(names.i))
        names.i <- paste(names.i, sep='', collapse=col.str)
      else names.i <- NA
      invisible(names.i)
    }
  ## branching: if x is.null or has no names
  if (is.null(x)) {
    return(NULL) }
  else {
    x <- sort(x, na.last=TRUE) # sort x
    if (!is.null(round)) x <- round(x, round)
    ## vector of unique values
    values <- unique(x, na.last = TRUE)
    ## names per value
    nam <- NA
    for (i in seq(along=values)) {
      names.i <- names(x)[x==values[i]]
      if (!is.null(names.i)) nam[i] <- pasteunique(names.i, col.str)
      else nam[i] <- NA
    }
    ## names for NA
    if (is.na(values[length(values)]))
      { names.i <- names(x)[is.na(x)]
        nam[length(values)] <- pasteunique(names.i, col.str)
      }
    names(values) <- nam
    return(values)
  }
}

### function factvn - version 0.3.0 - 11.7.2006
### function to build a factor from vector with named elements
##
##  function description:
##  - if fromvaluesnames is not given factvn calls factor
##  - if fromvaluesnames is in c('values', 'names') a factor based on
##    names(x) is constructed
##
##  - arguments:
##    - x         source vector with names for (some) elements
##                x must be numeric or character.
##    - fromvaluesnames:
##      - fromvaluenames='values': levels are ordered according to the values
##        of x
##      - fromvaluenames='names': levels are ordered according to the names
##        of x
##    - ordered:
##      - fromvaluesnames is not given: ordered=is.ordered(x)
##      - fromvaluesnames='values': ordered=TRUE
##      - fromvaluesnames='names': ordered=FALSE
##
##  - value:
##  - if fromvaluesnames is not given see factor
##  - if fromvaluesnames is in c('values', 'names') a factor based on
##    names(x) is constructed. All x-values without names are NA.
##    The (final) levels of value are the unique(names(x)).

factvn <- function (x = character(), levels = sort(unique.default(x),
                    na.last = TRUE), labels = levels, exclude = NA,
                    ordered = is.ordered(x), fromvaluesnames=NULL)
{
  ## set ordered depending on fromvaluesnames
  if (!missing(fromvaluesnames))
    if (missing(ordered)) {
      if (fromvaluesnames=='values') ord <- TRUE
      if (fromvaluesnames=='names') ord <- FALSE
    } else ord <- ordered
  if (!missing(fromvaluesnames)) {
    if (fromvaluesnames=='values')
      fx <- factor(names(x), levels=unique(names(value.names(x))),
                   exclude=exclude, ordered=ord)
    if (fromvaluesnames=='names')
      fx <- factor(names(x), levels=sort(unique(names(value.names(x)))),
                   exclude=exclude, ordered=ord)
  } else  fx <- factor(x, levels, labels, ordered)
  return(fx)
}

>> 
...snip...

>-- 
>Frank E Harrell Jr   Professor and Chair           School of Medicine
>                      Department of Biostatistics   Vanderbilt University
>