[R] Reshaping dataframes

Thu Aug 23 18:09:24 CEST 2012

On Aug 23, 2012, at 2:02 AM, Ingmar Schuster wrote:

> Thanks Rui!
>
> Anybody with ideas regarding filling _while_ binding data frames  
> instead of
> afterwards?

Not sure what you mean by " _while_ binding dataframes" but the  
original question seems answered by this sentence from the help file  
for factor:

"For a numeric x, set exclude=NULL to make NA an extra level (prints  
as <NA>); by default, this is the last level."

fac <- factor(fac, exclude=NULL) # would skip all that `is.na()`,  
`level=` gymnastics

If you want to loop over factor dataframe columns:

facidx <-  sapply(d, is.factor)
d[ ,facidx ] <- lapply( d[ , facidx ], factor, exclude=NULL)

I see no parameters to data.frame or read.table that would allow  
specifying different than the default behavior for factor().

-- 
David

>
> Ingmar
>
> 2012/8/22 Rui Barradas <ruipbarradas at sapo.pt>
>
>> Hello,
>>
>> Your function doesn't seem to be very difficult to generalize.
>>
>> d <- read.table(text="
>>
>>   trg_type child_type_1
>> 1 Scientists NA
>> 2        of         used
>> ", header=TRUE)
>> str(d)
>>
>> subs_na <- function(tok, na_factor_level = "NOT_REALIZED", na_num =  
>> 99999)
>> {
>>    ifac <- which(sapply(tok, is.factor))
>>    inum <- which(sapply(tok, is.numeric))
>>    for(i in ifac) {
>>        levels(tok[, i]) <- c(levels(tok[, i]), na_factor_level)
>>        tok[is.na(tok[, i]), i] <- as.factor(na_factor_level)
>>    }
>>    for(i in inum)
>>        tok[is.na(tok[, i]), i] <- na_num
>>    tok
>> }
>>
>> r1 <- substitute_na(d)
>> r2 <- subs_na(d)
>> str(r1)
>> str(r2)
>> identical(r1, r2)  # TRUE
>>
>> You could use the same coding for characters, Dates, etc.
>>
>> Hope this helps,
>>
>> Rui Barradas
>>
>> Em 22-08-2012 20:16, Ingmar Schuster escreveu:
>>
>> Hi,
>>>
>>> I have a data set with variables that are _not_ missing at random.  
>>> Now I
>>> use a package for learning a Bayesian Network which won't accept  
>>> NA as a
>>> value. From a database I query data.frames with k,k+n,k+2n, ...  
>>> variables
>>> (there are always at least k variables as leftmost columns). Using
>>> rbind.fill from the reshape package on two data frames I would get  
>>> a data
>>> frame like
>>>
>>>    trg_type child_type_1
>>> 1 Scientists NA
>>> 2        of         used
>>>
>>> Now to get rid of NA values I use the following function, which  
>>> works for
>>> data frames with only factor values:
>>>
>>>   substitute_na <- function(tok, na_factor_level = "NOT_REALIZED") {
>>>     for (i in 1:length(tok)) {levels(tok[,i]) <- c(levels(tok[,i]),
>>> na_factor_level)}
>>>     tok[is.na(tok)] <- as.factor(na_factor_level)
>>>     return(tok)
>>>   }
>>>
>>> Is there a better/faster way to do it? It would also be great to  
>>> be able
>>> to
>>> distinguish factor columns from numeric columns and use a special  
>>> numeric
>>> value there. The current version of rbind.fill makes no direct  
>>> reference
>>> to
>>> the fill value so that I could change its implementation for my  
>>> purpose.
>>>
>>>
>>> Thanks!
>>>
>>> Ingmar
>>>
>>>
>>
>
>
> -- 
> Ingmar Schuster
> Natural Language Processing Group
> Department of Computer Science
> University of Leipzig
> Johannisgasse 26
> 04103 Leipzig, Germany
>
> Tel. +49 341 9732205
>
> http://asv.informatik.uni-leipzig.de/en/staff/Ingmar_Schuster
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
Alameda, CA, USA