[R] How to ignore data

Bert Gunter gunter.berton at gene.com
Mon Dec 13 18:09:55 CET 2010


>>
>> Values to be ignored
>>
>> 0 - zero and 1 this is in addition to NA (null)
>>
>> The reason is that I need to use the log10 of the values when performing
>> the calculation.
>>
>> Currently I hand massage the data set, about a 100 values, of which less
>> than 5 to 10 are in this category.
>>

This is probably a bad idea, perhaps even a VERY bad idea, though
without knowing the details of what you are doing, one cannot be sure.
The reason is that by removing these values you may be biasing the
analysis. For example, if they are values that are below some
threshhold LOD (limit of detection) they are censored, and this
censoring needs to be explicitly accounted for (e.g. with the survival
package). If they represent in some sense "unusual" values (some call
these "outliers", a pejorative label that I believe should be avoided
given all the scientfic and statistical BS associated with the term),
one is then bound to ask, "How unusual? Why unusual? What do they tell
us about the scientific questions of concern?" If they are just
"errors" of some sort (like negative incomes or volumes), well then,
you're probably OK removing them.

The reason I mention this is that I have seen scientists too often use
poor strategies for analyzing censored data, and this can end up
producing baloney conclusions that don't replicate. It's a somewhat
subtle, but surprisingly common issue (due to measurement limitations)
that most scientists are neither trained to recognize nor deal with.
So their problematical approaches are understandable, but unfortunate.
 Therefore take care ... and, if necessary, consuilt your local
statistician for help.

-- Bert

>> The NA values are NOT the problem
>>
>> What I was hoping was that I did not have to use a series of if and
>> ifelse statements. Perhaps there is a more elegant solution.
>
>
>  It would help to have a more precise/reproducible example, but if
> your data set (a data frame) is d, and you want to ignore cases where
> the response variable x is either 0 or 1, you could say
>
>  ds <- subset(d,!x %in% c(0,1))
>
> Some modeling functions (such as lm()), but not all of them, have
> a 'subset' argument so you can provide this criterion on the fly:
>
>  lm(...,subset=(!x %in% c(0,1)))
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Bert Gunter
Genentech Nonclinical Biostatistics
467-7374
http://devo.gene.com/groups/devo/depts/ncb/home.shtml



More information about the R-help mailing list