[R] Keep value lables with data frame manipulation

Heinz Tuechler tuechler at gmx.at
Fri Jul 14 21:20:55 CEST 2006


At 11:02 13.07.2006 -0500, Frank E Harrell Jr wrote:
>Heinz Tuechler wrote:
>> At 08:11 13.07.2006 -0500, Frank E Harrell Jr wrote:
>>> Heinz Tuechler wrote:
>>>> At 13:14 12.07.2006 -0500, Marc Schwartz (via MN) wrote:
>>>>> On Wed, 2006-07-12 at 17:41 +0100, Jol, Arne wrote:
>>>>>> Dear R,
>>>>>>
>>>>>> I import data from spss into a R data.frame. On this rawdata I do some
>>>>>> data processing (selection of observations, normalization, recoding of
>>>>>> variables etc..). The result is stored in a new data.frame, however, in
>>>>>> this new data.frame the value labels are lost.
>>>>>>
>>>>>> Example of what I do in code:
>>>>>>
>>>>>> # read raw data from spss
>>>>>> rawdata <- read.spss("./data/T50937.SAV",
>>>>>> 	use.value.labels=FALSE,to.data.frame=TRUE)
>>>>>>
>>>>>> # select the observations that we need
>>>>>> diarydata <- rawdata[rawdata$D22==2 | rawdata$D22==3 |
rawdata$D22==17 |
>>>>>> rawdata$D22==18 | rawdata$D22==20 | rawdata$D22==22 |
>>>>>>  			rawdata$D22==24 | rawdata$D22==33,]
>>>>>>
>>>>>> The result is that rawdata$D22 has value labels and that diarydata$D22
>>>>>> is numeric without value labels.
>>>>>>
>>>>>> Question: How can I prevent this from happening?
>>>>>>
>>>>>> Thanks in advance!
>>>>>> Groeten,
>>>>>> Arne
>>>>> Two things:
>>>>>
>>>>> 1. With respect to your subsetting, your lengthy code can be replaced
>>>>> with the following:
>>>>>
>>>>>  diarydata <- subset(rawdata, D22 %in% c(2, 3, 17, 18, 20, 22, 24, 33))
>>>>>
>>>>> See ?subset and ?"%in%" for more information.
>>>>>
>>>>>
>>>>> 2. With respect to keeping the label related attributes, the
>>>>> 'value.labels' attribute and the 'variable.labels' attribute will not by
>>>>> default survive the use of "[".data.frame in R (see ?Extract
>>>>> and ?"[.data.frame").
>>>>>
>>>>> On the other hand, based upon my review of ?read.spss, the SPSS value
>>>>> labels should be converted to the factor levels of the respective
>>>>> columns when 'use.value.labels = TRUE' and these would survive a
>>>>> subsetting.
>>>>>
>>>>> If you want to consider a solution to the attribute subsetting issue,
>>>>> you might want to review the following post by Gabor Grothendieck in
>>>>> May, which provides a possible solution:
>>>>>
>>>>>  https://stat.ethz.ch/pipermail/r-help/2006-May/106308.html
>>>>>
>>>>> and this post by me, for an explanation of what is happening in Gabor's
>>>>> solution:
>>>>>
>>>>>  https://stat.ethz.ch/pipermail/r-help/2006-May/106351.html
>>>>>
>>>>> HTH,
>>>>>
>>>>> Marc Schwartz
>>>>>
>>>> Hello Mark and Arne,
>>>>
>>>> I worked on the suggestions of Gabor and Mark and programmed some
functions
>>>> in this way, but they are very, very preliminary (see below).
>>>> In my view there is a lack of convenient possibilities in R to document
>>>> empirical data by variable labels, value labels, etc. I would prefer to
>>>> have these possibilities in the "standard" configuration.
>>>> So I sketched a concept, but in my view it would only be useful, if there
>>>> was some acceptance by the core developers of R.
>>>>
>>>> The concept would be to define a class. For now I call it "source.data".
>>>> To design it more flexible than the Hmisc class "labelled" I would
define a
>>>> related option "source.data.attributes" with default c('value.labels',
>>>> 'variable.name', 'label')). This option contains all attributes that
should
>>>> persist in subsetting/indexing.
>>>>
>>>> I made only some very, very preliminary tests with these functions,
mainly
>>>> because I am not happy with defining a new class. Instead I would prefer,
>>>> if this functionality could be integrated in the Hmisc class "labelled",
>>>> since this is in my view the best known starting point for data
>>>> documentation in R.
>>>>
>>>> I would be happy, if there were some discussion about the wishes/needs of
>>>> other Rusers concerning data documentation.
>>>>
>>>> Greetings,
>>>>
>>>> Heinz
>>> I feel that separating variable labels and value labels and just using 
>>> factors for value labels works fine, and I would urge you not to create 
>>> a new system that will not benefit from the many Hmisc functions that 
>>> use variable labels and units.  [.data.frame in Hmisc keeps all
attributes.
>>>
>>> Frank
>>>
>> 
>> Frank,
>> 
>> of course I aggree with you about the importance of Hmisc and as I said, I
>> do not want to define a new class, but in my view factors are no good
>> substitute for value labels.
>> As the language definition (version 2.3.1 (2006-06-05) Draft, page 7) says:
>> "Factors are currently implemented using an integer array to specify the
>> actual levels and a second array of names that are mapped to the integers.
>> Rather unfortunately users often make use of the implementation in order to
>> make some calculations easier." 
>> So, in my view, the levels represent the "values" of the factor.
>> This has inconveniencies if you want to use value labels in different
>> languages. Further I do not see a simple method to label numerical
>> variables. I often encounter discrete, but still metric data, as e.g. risk
>> scores. Usually it would be nice to use them in their original coding,
>> which may include zero or decimal places and to label them at the same
time.
>> Personally at the moment I try to solve this problem by following a
>> suggestion of Martin, Dimitis and others to use names instead. I doubt,
>> however, that this is a good solution, but at least it makes it possible to
>> have the source data numerically coded and in this sense "language free"
>> (see first attempts of functions below).
>> 
>> Heinz
>> 
>Those are excellent points Heinz.  I addressed that problem partially in 
>sas.get - see the sascodes attribute.
>
>Frank
>

Frank, I looked at your function sas.get. You solved the problem with a lot
of effort. Don't you think that it would be easier to create just one new
class, say "documented", which offers the possibility to represent the
original data as it is and to add all the useful descriptions like variable
labels, value labels, units, special missing values, and may be others.
If I remember correctly SAS, SPSS and BMDP offer these possibilities since
many years, and in my view for good reason. I am thinking about this
questions since I started using R about two years ago and I wonder, why
there seems to be so little interest in these questions.
In my work good documentation of the _unchanged_ data is very important,
also because it eases checking the data for errors.

Heinz


>> ...snip...



More information about the R-help mailing list