[R] how to replace NA with a specific score that is dependant on another indicator variable

David Winsemius dwinsemius at comcast.net
Wed Sep 1 16:30:12 CEST 2010


On Sep 1, 2010, at 10:19 AM, David Winsemius wrote:

>
> On Sep 1, 2010, at 9:55 AM, David Winsemius wrote:
>
>>
>> On Sep 1, 2010, at 9:20 AM, Chris Howden wrote:
>>
>>> Hi everyone,
>>>
>>>
>>>
>>> I’m looking for a clever bit of code to replace NA’s with a  
>>> specific score
>>> depending on an indicator variable.
>>>
>>> I can see how to do it using lots of if statements but I’m sure  
>>> there most
>>> be a neater, better way of doing it.
>>>
>>> Any ideas at all will be much appreciated, I’m dreading coding up  
>>> all those
>>> if statements!!!!!
>>>
>>> My problem is as follows:
>>>
>>> I have a data set with lots of missing data:
>>>
>>> EG Raw Data Set
>>>
>>> Category             variable1             variable2              
>>> variable3
>>>
>>>    1                            5                            NA
>>> NA
>>>
>>>    1                           NA
>>> 3                              4
>>>
>>>    2                            NA
>>>     7                            NA
>>
>> This does not do its work by category (since I got tired of fixing  
>> mangled htmlized datasets) but it seems to me that a tapply "wrap"  
>> could do either of these operations within categories:
>
> Why not try out Hadley's plyr package?
>
> require(plyr)
>  ddply(egraw2, .(category), .fun=function(df) {
>                sapply(df[-1],
#Take out the [-1]
>                     function(x) {mnx <- mean(x, na.rm=TRUE);
>                                  sapply(x, function(z) if (is.na(z)) 
> {mnx}else{z})
>                                 }
>                        )                      }          )
>
> Tested on
> egraw2 <- data.frame(category=rep(1:4, 4),
>                    var1=sample(c(1:3, NA,NA), 16, replace =TRUE),
>                    var2=sample(c(5:10, NA,NA), 16, replace =TRUE),
>                   var3=sample(c(15:20, NA,NA), 16, replace =TRUE) )

It did not create an error and only after I sorted that dataframe and  
the first ddply result did I see that some sort of misregistration had  
occurred; Better with:

res <-ddply(egraw2, .(category), .fun=function(df) {
                sapply(df,
                     function(x) {mnx <- mean(x, na.rm=TRUE);
                                  sapply(x, function(z) if (is.na(z)) 
{mnx}else{z})
                                 }
                        )                      }          )

>
> -- 
> David.
>>
>>
>> > egraw
>> Category variable1 variable2 variable3
>> 1        1         5        NA        NA
>> 2        1        NA         3         4
>> 3        2        NA         7        NA
>>
>> > lapply(egraw, function(x) {mnx <- mean(x, na.rm=TRUE)
>>                            sapply(x, function(z) if (is.na(z)) 
>> {mnx}else{z})
>>                           }
>>        )
>> $Category
>> [1] 1 1 2
>>
>> $variable1
>> [1] 5 5 5
>>
>> $variable2
>> [1] 5 3 7
>>
>> $variable3
>> [1] 4 4 4
>>
>> > sapply(egraw, function(x) {mnx <- mean(x, na.rm=TRUE)
>>                            sapply(x, function(z) if (is.na(z)) 
>> {mnx}else{z})
>>                           }
>>             )
>>    Category variable1 variable2 variable3
>> [1,]        1         5         5         4
>> [2,]        1         5         3         4
>> [3,]        2         5         7         4
>>
>>>
>>>  etc
>>>
>>> Now I want to replace the NA’s with the average for each category,  
>>> so if
>>> these averages were:
>>>
>>> EG Averages
>>>
>>> Category             variable1             variable2              
>>> variable3
>>>
>>>    1                           4.5
>>> 3.2                           2.5
>>>
>>>    2                           3.5
>>>     7.4                           5.9
>>>
>>>
>>>
>>> So I’d like my data set to look like the following once I’ve  
>>> replaced the
>>> NA’s with the appropriate category average:
>>>
>>> EG Imputed Data Set
>>>
>>> Category             variable1             variable2              
>>> variable3
>>>
>>>    1                            5                            3.2
>>> 2.5
>>>
>>>    1                           4.5
>>> 3                              4
>>>
>>>    2                           3.5
>>>   7                             5.9
>>>
>>>  etc
>>>
>>> Any ideas would be very much appreciated!!!!!
>>
>> You might add reading the Posing Guide and setting up your reader  
>> to post in plain text to your TODO list.
>>>
>>> thankyou
>>>
>>> Chris Howden
>>
>>> .
>>
>> David Winsemius, MD
>> West Hartford, CT
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius, MD
> West Hartford, CT
>

David Winsemius, MD
West Hartford, CT



More information about the R-help mailing list