[R] Correlation question

Stephane Vaucher vauchers at iro.umontreal.ca
Thu Sep 9 16:50:03 CEST 2010


Thank you Dennis,

You identified a factor (text column) that I was concerned with. 
I simplified my example to try and factor out possible causes. I 
eliminated the recurring values in columns (which were not the columns 
that caused problems). I produced three examples with simple data sets.

1. Correct output, 2 columns only:

> test.notext = read.csv('test-notext.csv')
> cor(test.notext, method='spearman')
                P3     HP_tot
P3      1.0000000 -0.2182876
HP_tot -0.2182876  1.0000000
> dput(test.notext)
structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L,
2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L),
     HP_tot = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 136L,
     136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 15L,
     15L, 15L, 15L, 15L, 15L, 15L)), .Names = c("P3", "HP_tot"
), class = "data.frame", row.names = c(NA, -25L))

2. Incorrect output where I introduced my P7 column containing text only 
the 'a' character:

> test = read.csv('test.csv')
> cor(test, method='spearman')
                P3 P7     HP_tot
P3      1.0000000 NA -0.2502878
P7             NA  1         NA
HP_tot -0.2502878 NA  1.0000000
Warning message:
In cor(test, method = "spearman") : the standard deviation is zero
> dput(test)
structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L,
2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L),
     P7 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
     1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
     ), .Label = "a", class = "factor"), HP_tot = c(10L, 10L,
     10L, 10L, 10L, 10L, 10L, 10L, 136L, 136L, 136L, 136L, 136L,
     136L, 136L, 136L, 136L, 136L, 15L, 15L, 15L, 15L, 15L, 15L,
     15L)), .Names = c("P3", "P7", "HP_tot"), class = "data.frame", 
row.names = c(NA,
-25L))

3. Incorrect output with P7 containing a variety of alpha-numeric 
characters (ascii), to factor out equal valued column issue. Notice that 
the text column is interpreted as a numeric value.

> test.number = read.csv('test-alpha.csv')
> cor(test.number, method='spearman')
                P3         P7     HP_tot
P3      1.0000000  0.4093108 -0.2502878
P7      0.4093108  1.0000000 -0.3807193
HP_tot -0.2502878 -0.3807193  1.0000000
> dput(test.number)
structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L,
2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L),
     P7 = structure(c(11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L,
     19L, 20L, 21L, 22L, 23L, 24L, 25L, 1L, 2L, 3L, 4L, 5L, 6L,
     7L, 8L, 9L, 10L), .Label = c("0", "1", "2", "3", "4", "5",
     "6", "7", "8", "9", "a", "b", "c", "d", "e", "f", "g", "h",
     "i", "j", "k", "l", "m", "n", "o"), class = "factor"), HP_tot = c(10L,
     10L, 10L, 10L, 10L, 10L, 10L, 10L, 136L, 136L, 136L, 136L,
     136L, 136L, 136L, 136L, 136L, 136L, 15L, 15L, 15L, 15L, 15L,
     15L, 15L)), .Names = c("P3", "P7", "HP_tot"), class = "data.frame", 
row.names = c(NA,
-25L))

Correct output is obtained by avoiding matrix computation of correlation:
> cor(test.number$P3, test.number$HP_tot, method='spearman')
[1] -0.2182876

It seems that a text column corrupts my correlation calculation (only in a 
matrix calculation). I assumed that text columns would not influence the 
result of the calculations.

Is this a correct behaviour? If not,I can submit a bug report? If it is, 
is there a known workaround?

cheers,
Stephane Vaucher

On Thu, 9 Sep 2010, Dennis Murphy wrote:

> Did you try taking out P7, which is text? Moreover, if you get a message
> saying ' the standard deviation is zero', it means that the entire column is
> constant. By definition, the covariance of a constant with a random variable
> is 0, but your data consists of values, so cor() understandably throws a
> warning that one or more of your columns are constant. Applying the
> following to your data (which I named expd instead),  we get
>
> sapply(expd[, -12], var)
>          P1           P2           P3           P4           P5
> P6
> 5.433333e-01 1.083333e+00 5.766667e-01 1.083333e+00 6.433333e-01
> 5.566667e-01
>          P8           P9          P10          P11          P12
> SITE
> 5.733333e-01 3.193333e+00 5.066667e-01 2.500000e-01 5.500000e+00
> 2.493333e+00
>      Errors     warnings       Manual        Total        H_tot
> HP1.1
> 9.072840e+03 2.081334e+04 7.433333e-01 3.823500e+04 3.880250e+03
> 2.676667e+00
>       HP1.2        HP1.3        HP1.4       HP_tot        HO1.1
> HO1.2
> 0.000000e+00 2.008440e+03 3.057067e+02 3.827250e+03 8.400000e-01
> 0.000000e+00
>       HO1.3        HO1.4       HO_tot        HU1.1        HU1.2
> HU1.3
> 0.000000e+00 0.000000e+00 8.400000e-01 0.000000e+00 2.100000e-01
> 2.266667e-01
>      HU_tot           HR        L_tot        LP1.1        LP1.2
> LP1.3
> 6.233333e-01 7.433333e-01 3.754610e+03 3.209333e+01 0.000000e+00
> 2.065010e+03
>       LP1.4       LP_tot        LO1.1        LO1.2        LO1.3
> LO1.4
> 2.246233e+02 3.590040e+03 3.684000e+01 0.000000e+00 0.000000e+00
> 2.840000e+00
>      LO_tot        LU1.1        LU1.2        LU1.3       LU_tot
> LR_tot
> 6.000000e+01 0.000000e+00 1.440000e+00 3.626667e+00 8.373333e+00
> 4.943333e+00
>      SP_tot        SP1.1        SP1.2        SP1.3        SP1.4
> SP_tot.1
> 6.911067e+02 4.225000e+01 0.000000e+00 1.009600e+02 4.161600e+02
> 3.071600e+02
>       SO1.1        SO1.2        SO1.3        SO1.4       SO_tot
> SU1.1
> 4.543333e+00 2.500000e-01 0.000000e+00 2.100000e-01 5.250000e+00
> 0.000000e+00
>       SU1.2        SU1.3       SU_tot           SR
> 1.556667e+00 4.225000e+01 3.504000e+01 4.225000e+01
>
> Which columns are constant?
> which(sapply(expd[, -12], var) < .Machine$double.eps)
> HP1.2 HO1.2 HO1.3 HO1.4 HU1.1 LP1.2 LO1.2 LO1.3 LU1.1 SP1.2 SO1.3 SU1.1
>   19    24    25    26    28    35    40    41    44    51    57    60
>
> I suspect that in your real data set, there aren't so many constant columns,
> but this is one way to check.
>
> HTH,
> Dennis
>
> On Wed, Sep 8, 2010 at 12:35 PM, Stephane Vaucher <vauchers at iro.umontreal.ca
>> wrote:
>
>> Hi everyone,
>>
>> I'm observing what I believe is weird behaviour when attempting to do
>> something very simple. I want a correlation matrix, but my matrix seems to
>> contain correlation values that are not found when executed on pairs:
>>
>>  test2$P2
>>>
>>  [1] 2 2 4 4 1 3 2 4 3 3 2 3 4 1 2 2 4 3 4 1 2 3 2 1 3
>>
>>> test2$HP_tot
>>>
>>  [1]  10  10  10  10  10  10  10  10 136 136 136 136 136 136 136 136 136
>> 136  15
>> [20]  15  15  15  15  15  15 c=cor(test2$P3,test2$HP_tot,method='spearman')
>>
>>> c
>>>
>> [1] -0.2182876
>>
>>> c=cor(test2,method='spearman')
>>>
>> Warning message:
>> In cor(test2, method = "spearman") : the standard deviation is zero
>>
>>> write(c,file='out.csv')
>>>
>>
>> from my spreadsheet
>> -0.25028783918741
>>
>> Most cells are correct, but not that one.
>>
>> If this is expected behaviour, I apologise for bothering you, I read the
>> documentation, but I do not know if the calculation of matrices and pairs is
>> done using the same function (eg, with respect to equal value observations).
>>
>> If this is not a desired behaviour, I noticed that it only occurs with a
>> relatively large matrix (I couldn't reproduce on a simple 2 column data
>> set). There might be a naming error.
>>
>>  names(test2)
>>>
>>  [1] "ID"                   "NOMBRE"               "MAIL"
>>  [4] "Age"                  "SEXO"                 "Studies"
>>  [7] "Hours_Internet"       "Vision.Disabilities"  "Other.disabilities"
>> [10] "Technology_Knowledge" "Start_Time"           "End_Time"
>> [13] "Duration"             "P1"                   "P1Book"
>> [16] "P1DVD"                "P2"                   "P3"
>> [19] "P4"                   "P5"                   "P6"
>> [22] "P8"                   "P9"                   "P10"
>> [25] "P11"                  "P12"                  "P7"
>> [28] "SITE"                 "Errors"               "warnings"
>> [31] "Manual"               "Total"                "H_tot"
>> [34] "HP1.1"                "HP1.2"                "HP1.3"
>> [37] "HP1.4"                "HP_tot"               "HO1.1"
>> [40] "HO1.2"                "HO1.3"                "HO1.4"
>> [43] "HO_tot"               "HU1.1"                "HU1.2"
>> [46] "HU1.3"                "HU_tot"               "HR"
>> [49] "L_tot"                "LP1.1"                "LP1.2"
>> [52] "LP1.3"                "LP1.4"                "LP_tot"
>> [55] "LO1.1"                "LO1.2"                "LO1.3"
>> [58] "LO1.4"                "LO_tot"               "LU1.1"
>> [61] "LU1.2"                "LU1.3"                "LU_tot"
>> [64] "LR_tot"               "SP_tot"               "SP1.1"
>> [67] "SP1.2"                "SP1.3"                "SP1.4"
>> [70] "SP_tot.1"             "SO1.1"                "SO1.2"
>> [73] "SO1.3"                "SO1.4"                "SO_tot"
>> [76] "SU1.1"                "SU1.2"                "SU1.3"
>> [79] "SU_tot"               "SR"
>>
>> Thank you in advance,
>> Stephane Vaucher
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>



More information about the R-help mailing list