[R] correlation with missing values.. different answers

peter dalgaard pdalgd at gmail.com
Mon Apr 14 15:45:57 CEST 2014


On 14 Apr 2014, at 05:02 , Paul Tanger <paul.tanger at colostate.edu> wrote:

> Thanks, I did not realize it was deleting rows!  I was afraid to try
> "pairwise.complete.obs" because it said something about resulting in a
> matrix which is not "positive semi-definite" (and googling that term
> just confused me more).  

It means that you can get a covariance matrix that isn't one. I.e., it may predict that some linear combination of your variables has negative variance. It may turn out not to be a problem in practice, but that sort of thing tends to worry theoreticians.

> But I ran the dataset through JMP and got the
> same answers so I think that "pairwise.complete.obs" works for what I
> want to do.
> 

Actually, JMP 10 claims to be using the REML method, which is different from pairwise correlations (you can get both, so it is easy to check that they differ). I'm not sure we have the REML method coded up anywhere; the ML counterpart is in package mvnmle, and one might hope that REML isn't alll that much harder.

> On Sun, Apr 13, 2014 at 7:36 PM, arun <smartpink111 at yahoo.com> wrote:
>> 
>> 
>> 
>> Hi,
>> 
>> I think in this case, when you use "na.or.complete", all the NA rows are removed for the full dataset.
>> cor(swM[-1,1:2])
>> #          Frtlty    Agrclt
>> #Frtlty 1.0000000 0.3920289
>> #Agrclt 0.3920289 1.0000000
>> 
>> cor(swM[-1,])[1:2,1:2]
>> #Frtlty    Agrclt
>> #Frtlty 1.0000000 0.3920289
>> #Agrclt 0.3920289 1.0000000
>> 
>> May be you can try with "pairwise.complete.obs"
>> cor(swM, use = "pairwise.complete.obs")
>> #           Frtlty      Agrclt     Exmntn      Eductn     Cathlc      Infn.M
>> #Frtlty  1.0000000  0.39202893 -0.6531492 -0.66378886  0.4723129  0.41655603
>> #Agrclt  0.3920289  1.00000000 -0.7150561 -0.65221506  0.4152007 -0.03648427
>> #Exmntn -0.6531492 -0.71505612  1.0000000  0.69921153 -0.6003402 -0.11433546
>> #Eductn -0.6637889 -0.65221506  0.6992115  1.00000000 -0.1791334 -0.09932185
>> #Cathlc  0.4723129  0.41520069 -0.6003402 -0.17913339  1.0000000  0.18503786
>> #Infn.M  0.4165560 -0.03648427 -0.1143355 -0.09932185  0.1850379  1.00000000
>> cor(swM[,1:2],use="pairwise.complete.obs")
>> #          Frtlty    Agrclt
>> #Frtlty 1.0000000 0.3920289
>> #Agrclt 0.3920289 1.0000000
>> 
>> A.K.
>> 
>> On Sunday, April 13, 2014 9:11 PM, Paul Tanger <paul.tanger at colostate.edu> wrote:
>> Hi,
>> I can't seem to figure out why this gives me different answers.  Probably
>> something obvious, but I thought they would be the same.
>> This is an minimal example from the help page of cor() :
>> 
>>> ## swM := "swiss" with  3 "missing"s :
>>> swM <- swiss
>>> colnames(swM) <- abbreviate(colnames(swiss), min=6)
>>> swM[1,2] <- swM[7,3] <- swM[25,5] <- NA # create 3 "missing"
>>> cor(swM, use = "na.or.complete")
>>           Frtlty      Agrclt     Exmntn      Eductn     Cathlc      Infn.M
>> Frtlty  1.0000000  0.37821953 -0.6548306 -0.67421581  0.4772298  0.38781500
>> Agrclt  0.3782195  1.00000000 -0.7127078 -0.64337782  0.4014837 -0.07168223
>> Exmntn -0.6548306 -0.71270778  1.0000000  0.69776906 -0.6079436 -0.10710047
>> Eductn -0.6742158 -0.64337782  0.6977691  1.00000000 -0.1701445 -0.08343279
>> Cathlc  0.4772298  0.40148365 -0.6079436 -0.17014449  1.0000000  0.17221594
>> Infn.M  0.3878150 -0.07168223 -0.1071005 -0.08343279  0.1722159  1.00000000
>>> # why isn't this the same?
>>> cor(swM[,c(1:2)], use = "na.or.complete")
>>          Frtlty    Agrclt
>> Frtlty 1.0000000 0.3920289
>> Agrclt 0.3920289 1.0000000
>> 
>>    [[alternative HTML version deleted]]
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com




More information about the R-help mailing list