[R] variable labels to accompany data.frame

David Winsemius dwinsemius at comcast.net
Wed Oct 28 19:12:59 CET 2009


I find that Harrell's describe ( Hmisc) provides some of that desired  
functionality. When I am creating a paper codebook I will print the  
results of describe function fro a dataframe to create an overview  
snapshot and will post a copy of str(dfname) on the wall.

As his help page says:
"describe is especially useful for describing data frames created by  
*.get, as labels, formats, value labels, and (in the case of sas.get)  
frequencies of special missing values are printed."

I believe that Frank has developed some functions to replicate SAS's  
subtyping of NA values, although I have not explored such facilities.   
I also find that summary(dfname) provides some useful information that  
describe does not.
-- 
David.


On Oct 28, 2009, at 1:27 PM, Jacob Wegelin wrote:

>
> Often it is useful to keep a "codebook" to document the contents of  
> a dataset. (By "dataset" I mean
> a rectangular structure such as a dataframe.)
>
> The codebook has as many rows as the dataset has columns (variables,  
> fields).  The columns (fields)
> of the codebook may include:
>
> 	•       variable name
>
> 	•       type (character, factor, integer, etc)
>
> 	•       variable label (e.g., a variable called "bmi2" might be  
> labeled "BMI hand-input by
> 	clinic personnel, must be checked"
>
> 	•       permissible values
>
> 	•       which values indicate missing (and potentially different  
> kinds of missing)
>
> Some statistics software (e.g., SPSS and Stata) provides at least a  
> subset of this kind of
> information automatically in a convenient form. For instance, in  
> Stata one can define a "label" for
> a variable and it is thenceforth linked to the variable. In output  
> from certain modeling and
> graphics functions, Stata by default uses the label rather than the  
> variable name.
>
> Furthemore: In Stata, if "myvariable" is labeled numeric (in R  
> lingo, a factor), and I type
>
> codebook myvariable
>
> then Stata tells me, among other things, the "levels" of myvariable.
>
> Does a tool of this sort exist in R?
>
> The prompt() function is related to this, but prompt(someDataFrame)  
> creates a text file on disk. The
> text file is associated with, but not unambiguously linked to,  
> someDataFrame.
>
> The epicalc function codebook() provides a summary of a dataframe  
> similar to that created by
> summary() but easier to read. But this is not a way to define and  
> keep track of labels that are
> linked to variables.
>
> To link a dataframe to its codebook, one could do the following "by  
> hand": Create a list, say,
> "somedata", where somedata$DATA is a dataframe that contains the  
> data, and somedata$VARIABLE is also
> a dataframe, but serves as the codebook. For instance, the following  
> function creates a template
> into which one could subsequently edit to insert variable labels and  
> turn into somedata$VARIABLE.
>
> fnJunk <-function( THESEDATA ) {
> #  From a dataframe, make the start of a codebook.
>   if(!is.data.frame(THESEDATA)) stop("!is.data.frame(THESEDATA)")
>   data.frame(
>      Variable=names(THESEDATA)
>      , class=sapply(THESEDATA, class)
>      , type=sapply(THESEDATA, typeof)
>      , label=""
>      , comment=""
>      )
> }
>
>
> But the following automatic behavior would be nice:
>
> 	•       We should be able to treat somedata exactly as we treat a  
> dataframe, so that the
> 	fact that it possesses a "codebook" is merely an added benefit, not  
> an interference with the
> 	usual tasks.
>
> 	•       If we delete a column of somedata$DATA, the associated row  
> of somedata$VARIABLE
> 	should be automatically deleted.
>
> 	•       If we add a column to somedata$DATA, the associated column  
> should be inserted in
> 	somedata$VARIABLE, and some of the fields automatically populated  
> such as variable name and
> 	type.  It could get fancier. For instance:
>
> 	•       If we try to add a value to a field in somedata$DATA which  
> is not permitted by the
> 	"permissible values" listed for this field in somedata$VARIABLE, we  
> get an error.
>
> Has anyone already thought this through, maybe defined a class and  
> associated methods?
>
> Thanks
>
> Jacob A. Wegelin
> Assistant Professor
> Department of Biostatistics
> Virginia Commonwealth University
> 730 East Broad Street Room 3006
> P. O. Box 980032
> Richmond VA 23298-0032
> U.S.A. E-mail: jwegelin at vcu.edu URL: http://www.people.vcu.edu/~jwegelin______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
Heritage Laboratories
West Hartford, CT




More information about the R-help mailing list