[R] Keep value lables with data frame manipulation

Thu Jul 13 18:02:01 CEST 2006

Heinz Tuechler wrote:
> At 08:11 13.07.2006 -0500, Frank E Harrell Jr wrote:
>> Heinz Tuechler wrote:
>>> At 13:14 12.07.2006 -0500, Marc Schwartz (via MN) wrote:
>>>> On Wed, 2006-07-12 at 17:41 +0100, Jol, Arne wrote:
>>>>> Dear R,
>>>>>
>>>>> I import data from spss into a R data.frame. On this rawdata I do some
>>>>> data processing (selection of observations, normalization, recoding of
>>>>> variables etc..). The result is stored in a new data.frame, however, in
>>>>> this new data.frame the value labels are lost.
>>>>>
>>>>> Example of what I do in code:
>>>>>
>>>>> # read raw data from spss
>>>>> rawdata <- read.spss("./data/T50937.SAV",
>>>>> 	use.value.labels=FALSE,to.data.frame=TRUE)
>>>>>
>>>>> # select the observations that we need
>>>>> diarydata <- rawdata[rawdata$D22==2 | rawdata$D22==3 | rawdata$D22==17 |
>>>>> rawdata$D22==18 | rawdata$D22==20 | rawdata$D22==22 |
>>>>>  			rawdata$D22==24 | rawdata$D22==33,]
>>>>>
>>>>> The result is that rawdata$D22 has value labels and that diarydata$D22
>>>>> is numeric without value labels.
>>>>>
>>>>> Question: How can I prevent this from happening?
>>>>>
>>>>> Thanks in advance!
>>>>> Groeten,
>>>>> Arne
>>>> Two things:
>>>>
>>>> 1. With respect to your subsetting, your lengthy code can be replaced
>>>> with the following:
>>>>
>>>>  diarydata <- subset(rawdata, D22 %in% c(2, 3, 17, 18, 20, 22, 24, 33))
>>>>
>>>> See ?subset and ?"%in%" for more information.
>>>>
>>>>
>>>> 2. With respect to keeping the label related attributes, the
>>>> 'value.labels' attribute and the 'variable.labels' attribute will not by
>>>> default survive the use of "[".data.frame in R (see ?Extract
>>>> and ?"[.data.frame").
>>>>
>>>> On the other hand, based upon my review of ?read.spss, the SPSS value
>>>> labels should be converted to the factor levels of the respective
>>>> columns when 'use.value.labels = TRUE' and these would survive a
>>>> subsetting.
>>>>
>>>> If you want to consider a solution to the attribute subsetting issue,
>>>> you might want to review the following post by Gabor Grothendieck in
>>>> May, which provides a possible solution:
>>>>
>>>>  https://stat.ethz.ch/pipermail/r-help/2006-May/106308.html
>>>>
>>>> and this post by me, for an explanation of what is happening in Gabor's
>>>> solution:
>>>>
>>>>  https://stat.ethz.ch/pipermail/r-help/2006-May/106351.html
>>>>
>>>> HTH,
>>>>
>>>> Marc Schwartz
>>>>
>>> Hello Mark and Arne,
>>>
>>> I worked on the suggestions of Gabor and Mark and programmed some functions
>>> in this way, but they are very, very preliminary (see below).
>>> In my view there is a lack of convenient possibilities in R to document
>>> empirical data by variable labels, value labels, etc. I would prefer to
>>> have these possibilities in the "standard" configuration.
>>> So I sketched a concept, but in my view it would only be useful, if there
>>> was some acceptance by the core developers of R.
>>>
>>> The concept would be to define a class. For now I call it "source.data".
>>> To design it more flexible than the Hmisc class "labelled" I would define a
>>> related option "source.data.attributes" with default c('value.labels',
>>> 'variable.name', 'label')). This option contains all attributes that should
>>> persist in subsetting/indexing.
>>>
>>> I made only some very, very preliminary tests with these functions, mainly
>>> because I am not happy with defining a new class. Instead I would prefer,
>>> if this functionality could be integrated in the Hmisc class "labelled",
>>> since this is in my view the best known starting point for data
>>> documentation in R.
>>>
>>> I would be happy, if there were some discussion about the wishes/needs of
>>> other Rusers concerning data documentation.
>>>
>>> Greetings,
>>>
>>> Heinz
>> I feel that separating variable labels and value labels and just using 
>> factors for value labels works fine, and I would urge you not to create 
>> a new system that will not benefit from the many Hmisc functions that 
>> use variable labels and units.  [.data.frame in Hmisc keeps all attributes.
>>
>> Frank
>>
> 
> Frank,
> 
> of course I aggree with you about the importance of Hmisc and as I said, I
> do not want to define a new class, but in my view factors are no good
> substitute for value labels.
> As the language definition (version 2.3.1 (2006-06-05) Draft, page 7) says:
> "Factors are currently implemented using an integer array to specify the
> actual levels and a second array of names that are mapped to the integers.
> Rather unfortunately users often make use of the implementation in order to
> make some calculations easier." 
> So, in my view, the levels represent the "values" of the factor.
> This has inconveniencies if you want to use value labels in different
> languages. Further I do not see a simple method to label numerical
> variables. I often encounter discrete, but still metric data, as e.g. risk
> scores. Usually it would be nice to use them in their original coding,
> which may include zero or decimal places and to label them at the same time.
> Personally at the moment I try to solve this problem by following a
> suggestion of Martin, Dimitis and others to use names instead. I doubt,
> however, that this is a good solution, but at least it makes it possible to
> have the source data numerically coded and in this sense "language free"
> (see first attempts of functions below).
> 
> Heinz
> 
Those are excellent points Heinz.  I addressed that problem partially in 
sas.get - see the sascodes attribute.

Frank

> 
> ### These are very preliminary and untested versions.
> ### They are inteded only to demonstrate the concept, but not for productive
> ### work.
> 
> ### function "value.names<-" - version 0.3.0 - 11.7.2006
> ### function to assign names of elements according to their value
> ##
> ##  value.names<-
> ##  - arguments:
> ##    - action 
> ##      - set:           alle eventuell vorhandenen names löschen, valuenames
> ##                       setzen
> ##      - add.overwrite: leere und nicht leere names durch neue ersetzen
> ##      - add:           nur leere names durch neue ersetzen
> ##    - tolerance:       ordnet names den values innerhalb der Toleranz zu.
> ##                       Liegt ein Wert innerhalb des Toleranzbereiches
> ##                       mehrerer names, dann wird geringste Toleranz gewählt.
> ##    - round:           rounds values in value before matching
> ##                       This may lead to collapsing of different names in
> ##                       value to one name (and one value)
> ##    - col.str:         string used when collapsing several names
> ##    - others:          name for values not named by other names
> ##    - value:
> ##      
> ##
> ##  function description:
> ##  - x must be atomic, preferably numeric or character
> ##  - if tolerance is given, it must not be NA. tolerance < 0 is ignored
> ##  - to ensure consistency, value is processed by value.names()
> ##  - new.names are built by matching with/without tolerance
> ##  - new.names are assigned to names depending on argument action
> ##  - if argument others is given, others-name is assigned to all valid values
> ##    without name
> ##
> 
> "value.names<-" <- function(x, action='set', tolerance=NULL, round=NULL,
>                             col.str=' ', others=NULL, value)
> {
>   ## checking parameters
>   if(!is.atomic(x)) stop('x must be an atomic object')
>   if(!is.null(tolerance) &&
>      is.na(tolerance)) stop('if given, tolerance must not be NA')
>   ## to ensure consistency, process value by value.names
>   value <- value.names(value, round=round, col.str=col.str)
>   ## delete values with NA-name from value
>   value <- value[!is.na(names(value))]
>   old.names <- names(x) # store original names
>   ## -- building names
>   ##    - matching with/without tolerance
>   if(!is.null(tolerance) && tolerance > 0 && is.numeric(x))
>     ##      - matching with tolerance
>     { dif <- abs(outer(x, value, '-'))
>       dif[dif>tolerance] <- NA
>       within.tolerance <- apply(dif, 1, function(x) sum(!is.na(x)))
>       old.option.warn <- options('warn')[[1]]
>       options(warn=-1)
>       min.dif <- apply(dif, 1, function(x) which(x==min(x, na.rm=TRUE))[1])
>       options(warn=old.option.warn)
>       new.names <- names(value)[min.dif] }
>   else
>     ##      - matching without tolerance, i.e. exact matching
>     new.names <- names(value)[match( x, value)]
>   ##      - matching names for NA-values
>   if(length(names(value[is.na(value)]))==1)
>     new.names[is.na(x)] <- names(value[is.na(value)])
>   ## assign names depending on action
>   if (action=='set') new.names <- new.names
>   if (action=='add.overwrite') new.names[is.na(new.names)] <-
>     old.names[is.na(new.names)]
>   if (action=='add') new.names[!is.na(old.names)] <-
>     old.names[!is.na(old.names)]
>   ## assigning others-name to all valid values without name
>   if (!is.null(others)) new.names[!is.na(x) & is.na(new.names)] <-
>     as.character(others)
>   names(x) <- new.names
>   return(x)
> }
> 
> 
> ### function value.names - version 0.3.0 - 11.7.2006
> ### function to return names of elements according to their value
> ##
> ##  - arguments:
> ##    - x         source vector with names for (some) elements
> ##                x must be atomic ().
> ##                If x is a factor, value will be a factor. Consequently
> ##                names are only seen, if unclass() or print.default is used.
> ##    - col.str:         string used when collapsing several names
> ##                       default: "/"
> ##    - round:           rounds values in x
> ##                       This may lead to collapsing of different names for
> ##                       one value of x to one name (and one value)
> ##
> ##  - value:
> ##  - vector of the same class as x with sorted unique values and their names,
> ##    NULL, if x is NULL
> ##    - NA-values in x appear at the end
> ##    - if there is a 1:1 realtion between values and names in x, value
> ##      contains all unique combinations of value and name.
> ##    - if identical values in x have different (non NA), names these names
> ##      get collapsed to one new name, seperated by the string col.str
> ##      This applies also to NA-values in x with different names.
> ##    - NA-names get suppressed, if non-NA-names for the same x-value exist.
> ##    - Differen values in x with identical names remain seperated.
> ##    - values in x without name appear in value with name NA
> 
> value.names <- function(x, col.str=' ', round=NULL) {
>   ## checking parameters
>   if(!is.atomic(x)) stop('x must be an atomic object')
> ## -- define function for pasting unique non empty names
>   pasteunique <- function(names.i, col.str)
>     { names.i <- sort(unique(names.i))
>       names.i <- names.i[!names.i=='' & !is.na(names.i)] # exclude ''
>       if (length(names.i))
>         names.i <- paste(names.i, sep='', collapse=col.str)
>       else names.i <- NA
>       invisible(names.i)
>     }
>   ## branching: if x is.null or has no names
>   if (is.null(x)) {
>     return(NULL) }
>   else {
>     x <- sort(x, na.last=TRUE) # sort x
>     if (!is.null(round)) x <- round(x, round)
>     ## vector of unique values
>     values <- unique(x, na.last = TRUE)
>     ## names per value
>     nam <- NA
>     for (i in seq(along=values)) {
>       names.i <- names(x)[x==values[i]]
>       if (!is.null(names.i)) nam[i] <- pasteunique(names.i, col.str)
>       else nam[i] <- NA
>     }
>     ## names for NA
>     if (is.na(values[length(values)]))
>       { names.i <- names(x)[is.na(x)]
>         nam[length(values)] <- pasteunique(names.i, col.str)
>       }
>     names(values) <- nam
>     return(values)
>   }
> }
> 
> 
> ### function factvn - version 0.3.0 - 11.7.2006
> ### function to build a factor from vector with named elements
> ##
> ##  function description:
> ##  - if fromvaluesnames is not given factvn calls factor
> ##  - if fromvaluesnames is in c('values', 'names') a factor based on
> ##    names(x) is constructed
> ##
> ##  - arguments:
> ##    - x         source vector with names for (some) elements
> ##                x must be numeric or character.
> ##    - fromvaluesnames:
> ##      - fromvaluenames='values': levels are ordered according to the values
> ##        of x
> ##      - fromvaluenames='names': levels are ordered according to the names
> ##        of x
> ##    - ordered:
> ##      - fromvaluesnames is not given: ordered=is.ordered(x)
> ##      - fromvaluesnames='values': ordered=TRUE
> ##      - fromvaluesnames='names': ordered=FALSE
> ##
> ##  - value:
> ##  - if fromvaluesnames is not given see factor
> ##  - if fromvaluesnames is in c('values', 'names') a factor based on
> ##    names(x) is constructed. All x-values without names are NA.
> ##    The (final) levels of value are the unique(names(x)).
> 
> factvn <- function (x = character(), levels = sort(unique.default(x),
>                     na.last = TRUE), labels = levels, exclude = NA,
>                     ordered = is.ordered(x), fromvaluesnames=NULL)
> {
>   ## set ordered depending on fromvaluesnames
>   if (!missing(fromvaluesnames))
>     if (missing(ordered)) {
>       if (fromvaluesnames=='values') ord <- TRUE
>       if (fromvaluesnames=='names') ord <- FALSE
>     } else ord <- ordered
>   if (!missing(fromvaluesnames)) {
>     if (fromvaluesnames=='values')
>       fx <- factor(names(x), levels=unique(names(value.names(x))),
>                    exclude=exclude, ordered=ord)
>     if (fromvaluesnames=='names')
>       fx <- factor(names(x), levels=sort(unique(names(value.names(x)))),
>                    exclude=exclude, ordered=ord)
>   } else  fx <- factor(x, levels, labels, ordered)
>   return(fx)
> }
> 
> 
> ...snip...
> 
>