[R] Eliminate cases in a subset of a dataframe

Mon Sep 14 18:55:22 CEST 2009

Hi Holger,

On Sep 14, 2009, at 10:57 AM, Hollix wrote:

>
> Hi folks,
>
> I created a subset of a dataframe (i.e., selected only men):
>
> subdata <- subset(data,data$gender==1)
>
> After a residual diagnostic of a regression analysis, I detected three
> outliers:
>
> linmod <- lm(y ~ x, data=subdata)
> plot(linmod)
>
> Say, the cases 11,22, and 33 were outliers.
>
> Here comes the problem: When I want to exclude these three cases in a
> further regression analysis,
> - for instance with linmod2 <- lm(y[-c(11,22,33)] ~ x[-c(11,22,33)],
> data=subdata) - it does not work.

I suspect that your x matrix is probably a 2d matrix, so you might  
need to do:

R> lm(y[-c(11,22,33)] ~ x[-c(11,22,33),]

Note the trailing comma after the -c() vector when indexing into x!

Perhaps you can just remove those rows from your data and keep your  
formula "clean", like so?

R> linmod2 <- lm(y ~ x, data=subdata[-c(11,22,33),])

> I guess this has something to do with this strange "row.names"- 
> vector which
> has been added to the dataframe when creating the subset. I find it  
> very
> strange why R gives the case numbers in the diagnostics but then  
> doesn't
> allow me to use these numbers for further exclusion.

Hmm .. not sure what you mean, but this won't get in your way either  
way if you are using integers to index into your data.frame.

> Can anybody tell me:
> 1. what this row.names vector is
> 2. How I can refer to cases after creating a subset (e.g., in order to
> exclude them).

Refer to them by their position in the data.frame as you would if you  
didn't create a subset.

-steve

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
   |  Memorial Sloan-Kettering Cancer Center
   |  Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact