[R] a question about subsetting

arun smartpink111 at yahoo.com
Sun Jun 3 20:51:58 CEST 2012


HI,

I am not sure about whether your subset function is correct.  If you look into this link (http://stat.ethz.ch/R-manual/R-devel/library/base/html/subset.html), it says about how to use subset (subset(data, condition) instead of (subset=data==condition).  Also, the one I am describing about use a different format.  For eg, in your data, both Group1 and Group2 are separate columns with each having the same values for the independent variables.  Normally, for different groups (or factors with multiple levels), it will be in the same column like this:
 >dat2
   ID Group Mem   Gen Chance MSELGM MSELVR MSELFM MSELRL MSELEL ADOS Age
1   1     1  75  50.0     50     53     52     62     57     56    3  25
2   2     1  75  12.5     50     46     48     47     52     55    2  30
3   3     1  25  37.5     50     48     43     52     63     63    3  24
4   4     1  25  37.5     50     51     62     52     59     54    0  31
5   5     1  50  87.5     50     45     58     42     46     43    6  31
6   6     1 100 100.0     50     45     80     49     69     63    1  31
7   7     2  75  50.0     50     53     52     62     57     56    3  25
8   8     2  75  12.5     50     46     48     47     52     55    2  30
9   9     2  25  37.5     50     48     43     52     63     63    3  24
10 10     2  25  37.5     50     51     62     52     59     54    0  31
11 11     2  50  87.5     50     45     58     42     46     43    6  31
12 12     2 100 100.0     50     45     80     49     69     63    1  31


dat3<-subset(dat2,Group==1)
dat4<-subset(dat2,Group==2)
> dat4
   ID Group Mem   Gen Chance MSELGM MSELVR MSELFM MSELRL MSELEL ADOS Age
7   7     2  75  50.0     50     53     52     62     57     56    3  25
8   8     2  75  12.5     50     46     48     47     52     55    2  30
9   9     2  25  37.5     50     48     43     52     63     63    3  24
10 10     2  25  37.5     50     51     62     52     59     54    0  31
11 11     2  50  87.5     50     45     58     42     46     43    6  31
12 12     2 100 100.0     50     45     80     49     69     63    1  31


> fit1<-lm(Gen~MSELEL,data=dat3)
> fit2<-lm(Gen~MSELEL,data=dat4)

cor.test (dat3$Gen, dat3$MSELEL, method="pearson")

In the sample dataset that you showed here, you will get the same correlation results and regression results for both groups as there was no change in the values of the dependent or independent variables.

I guess this helps.



A.K.

  



----- Original Message -----
From: jacaranda tree <myjacaranda at yahoo.com>
To: "R-help at r-project.org" <R-help at r-project.org>
Cc: 
Sent: Sunday, June 3, 2012 11:51 AM
Subject: [R] a question about subsetting

Hi all,
I started using R about 3 weeks ago, and now I've pretty much figured out how to do the types of statistical modeling, graphs, tables etc. that I frequently  use (with zero background in computer languages or other statistical packages that are similar to R like S or SAS!). So it's been a  quite  rewarding process so far, and I thank you all R gurus for all your generous help!
That being said, my question is about applying a model or an analysis to different groups based on a grouping variable. Below is the first six rows of my data:

   ID Group1 Group2 Mem   Gen Chance MSELGM MSELVR MSELFM MSELRL MSELEL ADOS Age
1  1      1           1        75     50.0     50         53               52            62             57            56        3        25
2  2      1           1        75     12.5     50         46               48            47             52            55        2        30
3  3      1           1        25     37.5     50         48               43            52             63            63        3        24
4  4      1           1        25     37.5     50         51               62            52             59            54        0        31
5  5      1           1        50     87.5     50         45               58            42             46            43        6        31
6  6      1           1       100    100.0   50         45               80            49             69            63        1        31

Group1: First grouping variable
Group2: Second grouping variable
Mem: Memory trial
Gen: Generalization trial
MSEL: Mullen Scales of Early Learning (a scale measuring various skills in little children). GM: Gross Motor Scale, VR: Visual Reception, FM: Fine Motor, RL: receptive Language, EL: Expressive Language. 
ADOS: An autism-specific measure.

First I wanted to do correlations between Generalization (variable Gen) and expressive language (MSELEL) for each group of Group1. For this, I used lapply or by functions which work just fine. Here is the code with lapply: lapply(split(mydata, mydata$Group1), function(x){cor.test(x[,5],
x[,11], method = "pearson")})

Then I did regression. My DV is the variable Gen, and the IV is MSELEL. And again I wanted to do this for each group. Here is the code I came up with for each group:
fit1<-lm(Gen~ MSELEL, data=mydata, subset=mydata$Group1==1)

fit2<-lm(Gen~MSELEL, data=mydata, subset=mydata$Group1==2)

This works fine for regression, but when I used the "subset" function with the correlation (e.g.   cor.test (mydata$Gen, mydata$MSELEL, method="pearson", subset=mydata$Group1==1) , it did not work. It just did the correlation for the entire group and then used this for both groups. I was just curious as to why subset function works with regression, but not with correlation. Any thoughts? 
Thanks,
    [[alternative HTML version deleted]]


______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list