[R] Problem with subset() function?

Steven McKinney smckinney at bccrc.ca
Wed Jan 21 01:00:43 CET 2009


D'oh!  My apologies for the noise.

I thought I had verified class
from the str() output the user was 
showing me.  

> class(subset(mydf, ht >= 150.0 & wt <= 150.0, select = c(age)))
[1] "data.frame"
> class(subset(mydf, ht >= 150.0 & wt <= 150.0, select = c(age), drop = TRUE))
[1] "integer"
> class(mydf[mydf$ht >= 150.0 & mydf$wt <= 150.0, "age"])
[1] "integer"
> density(subset(mydf, ht >= 150.0 & wt <= 150.0, select = c(age), drop = TRUE))

Call:
	density.default(x = subset(mydf, ht >= 150 & wt <= 150, select = c(age),     drop = TRUE))

Data: subset(mydf, ht >= 150 & wt <= 150, select = c(age), drop = TRUE) (29 obs.);	Bandwidth 'bw' = 5.816

       x                y            
 Min.   : 4.553   Min.   :3.781e-05  
 1st Qu.:22.776   1st Qu.:3.108e-03  
 Median :41.000   Median :1.775e-02  
 Mean   :41.000   Mean   :1.370e-02  
 3rd Qu.:59.224   3rd Qu.:2.128e-02  
 Max.   :77.447   Max.   :2.665e-02  
> 



It's the "drop" arg that differs between
 density(subset(mydf, ht >= 150.0 & wt <= 150.0, select = c(age)))
and
 density(mydf[mydf$ht >= 150.0 & mydf$wt <= 150.0, "age"])

so it is
 subset(mydf, ht >= 150.0 & wt <= 150.0, select = c(age), drop = TRUE)
that is equivalent to
 mydf[mydf$ht >= 150.0 & mydf$wt <= 150.0, "age"]


Apologies and thanks for setting me straight.


Best

Steven McKinney

Statistician
Molecular Oncology and Breast Cancer Program
British Columbia Cancer Research Centre

email: smckinney +at+ bccrc +dot+ ca

tel: 604-675-8000 x7561

BCCRC
Molecular Oncology
675 West 10th Ave, Floor 4
Vancouver B.C. 
V5Z 1L3
Canada




-----Original Message-----
From: Marc Schwartz [mailto:marc_schwartz at comcast.net]
Sent: Tue 1/20/2009 3:20 PM
To: Steven McKinney
Cc: R-help at r-project.org
Subject: Re: [R] Problem with subset() function?
 
on 01/20/2009 05:02 PM Steven McKinney wrote:
> Hi all,
> 
> Can anyone explain why the following use of
> the subset() function produces a different
> outcome than the use of the "[" extractor?
> 
> The subset() function as used in
> 
>  density(subset(mydf, ht >= 150.0 & wt <= 150.0, select = c(age)))

Here you are asking density to be run on a data frame, which is what
subset returns, even when you select a single column. Thus, you get an
error since density() expects a numeric vector.

No bug in either subset() or the documentation.

You could do this:

  density(subset(mydf, ht >= 150.0 & wt <= 150.0, select = age)[[1]])


> appears to me from documentation to be equivalent to
> 
>  density(mydf[mydf$ht >= 150.0 & mydf$wt <= 150.0, "age"])

Here you are running density on a vector, so it works. This is because
the default behavior for "[.data.frame" has 'drop = TRUE', which means
that the returned result is coerced to the lowest possible dimension.
Thus, rather than a single data frame column, a vector is returned.

The result from subset() would be equivalent to using 'drop = FALSE'.

HTH,

Marc Schwartz


> (modulo exclusion of NAs) but use of the former yields an 
> error from density.default() (shown below).
> 
> 
> Is this a bug in the subset() machinery?  Or is it
> a documentation issue for the subset() function
> documentation or density() documentation?
> 
> I'm seeing issues such as this with newcomers to R
> who initially seem to prefer using subset() instead
> of the bracket extractor.  At this point these functions
> are clearly not exchangeable.  Should code be patched
> so that they are, or documentation amended to show
> when use of subset() is not appropriate?
> 
>> ### Bug in subset()?
> 
>> set.seed(123)
>> mydf <- data.frame(ht = 150 + 10 * rnorm(100),
> +                    wt = 150 + 10 * rnorm(100),
> +                    age = sample(20:60, size = 100, replace = TRUE)
> +                    )
> 
> 
>> density(subset(mydf, ht >= 150.0 & wt <= 150.0, select = c(age)))
> Error in density.default(subset(mydf, ht >= 150 & wt <= 150, select = c(age))) : 
>   argument 'x' must be numeric
> 
> 
>> density(mydf[mydf$ht >= 150.0 & mydf$wt <= 150.0, "age"])
> 
> Call:
> 	density.default(x = mydf[mydf$ht >= 150 & mydf$wt <= 150, "age"])
> 
> Data: mydf[mydf$ht >= 150 & mydf$wt <= 150, "age"] (29 obs.);	Bandwidth 'bw' = 5.816
> 
>        x                y            
>  Min.   : 4.553   Min.   :3.781e-05  
>  1st Qu.:22.776   1st Qu.:3.108e-03  
>  Median :41.000   Median :1.775e-02  
>  Mean   :41.000   Mean   :1.370e-02  
>  3rd Qu.:59.224   3rd Qu.:2.128e-02  
>  Max.   :77.447   Max.   :2.665e-02  
> 




More information about the R-help mailing list