[R] Handle missing values

Mon Jun 23 11:59:00 CEST 2008

On 23-Jun-08 09:35:10, Francisco Pastor wrote:
> Hi everyone
> I am new to R and have a question about missing values. I am
> trying to do a cluster analysis of monthly temperatures and
> my data are 14 columns with spatial coordinates (lat,lon)
> and 12 monthly values:
> 
> /lat  -  lon  -  temp1  -  //temp2  -  temp3 - ....   -  //temp12/
> 
> If I omit missing values (my missing values are 99.00) with
> 
> /mydata <- na.omit(mydata)/
> 
> every row with a missing value (i.e. eleven good temperature values
> and one month missing) is deleted. I would like to retain all valid
> values for the k-means analysis but excluding.
> 
> I've been trying and searching about na.omit, na.action, na.exclude
> but can't find the right point.
> 
> Any help would be appreciated.

As ?na.omit states, "incomplete cases" (any row in which one or
more values are missing) are removed by na.omit(), so you are
getting what you ask for.

Also, many functions "silently" do the same thing. For example,
fitting a linear model with lm() will also remove incomplete
cases.

What happens when you apply a function for clustering would
depend on how the function is written to deal with incomplete
cases. I'm no expert on the various clustering functions in R,
so hope others can give specific advice.

Often, however, to do what you want will require code to be
written specially. For example, if you have 14 columns as in
your example with columns 3-14 temperatures, and you wanted
to compute means, variances and covariances of the temperatures,
then for the means you could simply take the temepratures one
by one, and compute the mean over the non-missing values,
Similarly for the variances. For the covariances you could
take the "pairwise complete" cases: for each pair of temperatures
(say col 3 and col 7) you would use the cases

  mydata[(!is.na(mydata[,3]))&(!is.na(mydata[,7])),c(3,7)]

And so on. However, you could end up with inconsistencies
between variances and covariances with such code (e.g. the
variance-covariance matrix might not be positive definite);
this would not happen is you confined yourself to complete cases.

So it all depends on what you want to do with the data, and
on how the R functions which address your objectives behave
when faced with incomplete data.

Hoping this helps,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 23-Jun-08                                       Time: 10:58:57
------------------------------ XFMail ------------------------------