[R] Correlation question

Thu Sep 9 20:59:33 CEST 2010

Hi Stephane,

According to the NEWS file, as of 2.11.0: "cor() and cov() now test
for misuse with non-numeric arguments, such as the non-bug report
PR#14207" so there is no need for a new bug report.

Here is a simple way to select only numeric columns:

# Sample data
dat <- data.frame(a = 1:10L, b = runif(10), c = paste(1:10),
                  d = rep(TRUE, 10), e = factor(rep("a", 10)),
                  stringsAsFactors = FALSE)

# (this includes numeric and integer, btw)
dat[, sapply(dat, is.numeric)]

# if you wanted to include logicals (which cor() will work with)

class.test <- function(x) {
  output <- FALSE
  if(is.numeric(x) | is.logical(x)) {
    output <- TRUE}
  return(output)
}

# Columns that are numeric or logical
dat[, sapply(dat, class.test)]

HTH,

Josh

On Thu, Sep 9, 2010 at 10:53 AM, Stephane Vaucher
<vauchers at iro.umontreal.ca> wrote:
> Hi Josh,
>
> Initially, I was expecting R to simply ignore non-numeric data. I guess I
> was wrong... I copy-pasted what I observe, and I do not get an error when
> calculating correlations with text data. I can also do cor(test.n$P3,
> test$P7) without an error.
>
> If you have a function to select only numeric columns that you can share
> with me (and the list), that would be great. Of course, I'm wondering why
> your version of R produces different results from mine. I don't know if I
> should open a bug report. It would be good if someone (other than me)
> observed this problem in their environment.
>
> Here is what I am currently using:
>
> R version 2.10.1 (2009-12-14)
> x86_64-pc-linux-gnu
>
> locale:
>  [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8
>  [5] LC_MONETARY=C              LC_MESSAGES=en_CA.UTF-8
>  [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> The behaviour has been observed on:
>>
>> sessionInfo()
>
> Version 2.3.1 (2006-06-01)
> x86_64-redhat-linux-gnu
>
> attached base packages:
> [1] "methods"   "stats"     "graphics"  "grDevices" "utils"     "datasets"
> [7] "base"
>
> As well as on a 32 bit linux arch v2.9.0.
>
> Sincere regards,
> sv
>
> On Thu, 9 Sep 2010, Joshua Wiley wrote:
>
>> Hi Stephane,
>>
>> When I use your sample data (e.g., test, test.number), cor() throws an
>> error that x must be numeric (because of the factor or character
>> data).  Are you not getting any errors when trying to calculate the
>> correlation on these data?  If you are not, I wonder what version of R
>> are you using?  The quickest way to find out is sessionInfo().
>>
>> As far as a work around, it would be relative simple to find out which
>> columns of your data frame were not numeric or integer and exclude
>> those (I'm happy to provide that code if you want).
>>
>> Best regards,
>>
>> Josh
>>
>> On Thu, Sep 9, 2010 at 7:50 AM, Stephane Vaucher
>> <vauchers at iro.umontreal.ca> wrote:
>>>
>>> Thank you Dennis,
>>>
>>> You identified a factor (text column) that I was concerned with. I
>>> simplified my example to try and factor out possible causes. I eliminated
>>> the recurring values in columns (which were not the columns that caused
>>> problems). I produced three examples with simple data sets.
>>>
>>> 1. Correct output, 2 columns only:
>>>
>>>> test.notext = read.csv('test-notext.csv')
>>>> cor(test.notext, method='spearman')
>>>
>>>               P3     HP_tot
>>> P3      1.0000000 -0.2182876
>>> HP_tot -0.2182876  1.0000000
>>>>
>>>> dput(test.notext)
>>>
>>> structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L,
>>> 2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L),
>>>    HP_tot = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 136L,
>>>    136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 15L,
>>>    15L, 15L, 15L, 15L, 15L, 15L)), .Names = c("P3", "HP_tot"
>>> ), class = "data.frame", row.names = c(NA, -25L))
>>>
>>> 2. Incorrect output where I introduced my P7 column containing text only
>>> the
>>> 'a' character:
>>>
>>>> test = read.csv('test.csv')
>>>> cor(test, method='spearman')
>>>
>>>               P3 P7     HP_tot
>>> P3      1.0000000 NA -0.2502878
>>> P7             NA  1         NA
>>> HP_tot -0.2502878 NA  1.0000000
>>> Warning message:
>>> In cor(test, method = "spearman") : the standard deviation is zero
>>>>
>>>> dput(test)
>>>
>>> structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L,
>>> 2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L),
>>>    P7 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
>>>    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
>>>    ), .Label = "a", class = "factor"), HP_tot = c(10L, 10L,
>>>    10L, 10L, 10L, 10L, 10L, 10L, 136L, 136L, 136L, 136L, 136L,
>>>    136L, 136L, 136L, 136L, 136L, 15L, 15L, 15L, 15L, 15L, 15L,
>>>    15L)), .Names = c("P3", "P7", "HP_tot"), class = "data.frame",
>>> row.names
>>> = c(NA,
>>> -25L))
>>>
>>> 3. Incorrect output with P7 containing a variety of alpha-numeric
>>> characters
>>> (ascii), to factor out equal valued column issue. Notice that the text
>>> column is interpreted as a numeric value.
>>>
>>>> test.number = read.csv('test-alpha.csv')
>>>> cor(test.number, method='spearman')
>>>
>>>               P3         P7     HP_tot
>>> P3      1.0000000  0.4093108 -0.2502878
>>> P7      0.4093108  1.0000000 -0.3807193
>>> HP_tot -0.2502878 -0.3807193  1.0000000
>>>>
>>>> dput(test.number)
>>>
>>> structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L,
>>> 2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L),
>>>    P7 = structure(c(11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L,
>>>    19L, 20L, 21L, 22L, 23L, 24L, 25L, 1L, 2L, 3L, 4L, 5L, 6L,
>>>    7L, 8L, 9L, 10L), .Label = c("0", "1", "2", "3", "4", "5",
>>>    "6", "7", "8", "9", "a", "b", "c", "d", "e", "f", "g", "h",
>>>    "i", "j", "k", "l", "m", "n", "o"), class = "factor"), HP_tot = c(10L,
>>>    10L, 10L, 10L, 10L, 10L, 10L, 10L, 136L, 136L, 136L, 136L,
>>>    136L, 136L, 136L, 136L, 136L, 136L, 15L, 15L, 15L, 15L, 15L,
>>>    15L, 15L)), .Names = c("P3", "P7", "HP_tot"), class = "data.frame",
>>> row.names = c(NA,
>>> -25L))
>>>
>>> Correct output is obtained by avoiding matrix computation of correlation:
>>>>
>>>> cor(test.number$P3, test.number$HP_tot, method='spearman')
>>>
>>> [1] -0.2182876
>>>
>>> It seems that a text column corrupts my correlation calculation (only in
>>> a
>>> matrix calculation). I assumed that text columns would not influence the
>>> result of the calculations.
>>>
>>> Is this a correct behaviour? If not,I can submit a bug report? If it is,
>>> is
>>> there a known workaround?
>>>
>>> cheers,
>>> Stephane Vaucher
>>>
>>> On Thu, 9 Sep 2010, Dennis Murphy wrote:
>>>
>>>> Did you try taking out P7, which is text? Moreover, if you get a message
>>>> saying ' the standard deviation is zero', it means that the entire
>>>> column
>>>> is
>>>> constant. By definition, the covariance of a constant with a random
>>>> variable
>>>> is 0, but your data consists of values, so cor() understandably throws a
>>>> warning that one or more of your columns are constant. Applying the
>>>> following to your data (which I named expd instead),  we get
>>>>
>>>> sapply(expd[, -12], var)
>>>>         P1           P2           P3           P4           P5
>>>> P6
>>>> 5.433333e-01 1.083333e+00 5.766667e-01 1.083333e+00 6.433333e-01
>>>> 5.566667e-01
>>>>         P8           P9          P10          P11          P12
>>>> SITE
>>>> 5.733333e-01 3.193333e+00 5.066667e-01 2.500000e-01 5.500000e+00
>>>> 2.493333e+00
>>>>     Errors     warnings       Manual        Total        H_tot
>>>> HP1.1
>>>> 9.072840e+03 2.081334e+04 7.433333e-01 3.823500e+04 3.880250e+03
>>>> 2.676667e+00
>>>>      HP1.2        HP1.3        HP1.4       HP_tot        HO1.1
>>>> HO1.2
>>>> 0.000000e+00 2.008440e+03 3.057067e+02 3.827250e+03 8.400000e-01
>>>> 0.000000e+00
>>>>      HO1.3        HO1.4       HO_tot        HU1.1        HU1.2
>>>> HU1.3
>>>> 0.000000e+00 0.000000e+00 8.400000e-01 0.000000e+00 2.100000e-01
>>>> 2.266667e-01
>>>>     HU_tot           HR        L_tot        LP1.1        LP1.2
>>>> LP1.3
>>>> 6.233333e-01 7.433333e-01 3.754610e+03 3.209333e+01 0.000000e+00
>>>> 2.065010e+03
>>>>      LP1.4       LP_tot        LO1.1        LO1.2        LO1.3
>>>> LO1.4
>>>> 2.246233e+02 3.590040e+03 3.684000e+01 0.000000e+00 0.000000e+00
>>>> 2.840000e+00
>>>>     LO_tot        LU1.1        LU1.2        LU1.3       LU_tot
>>>> LR_tot
>>>> 6.000000e+01 0.000000e+00 1.440000e+00 3.626667e+00 8.373333e+00
>>>> 4.943333e+00
>>>>     SP_tot        SP1.1        SP1.2        SP1.3        SP1.4
>>>> SP_tot.1
>>>> 6.911067e+02 4.225000e+01 0.000000e+00 1.009600e+02 4.161600e+02
>>>> 3.071600e+02
>>>>      SO1.1        SO1.2        SO1.3        SO1.4       SO_tot
>>>> SU1.1
>>>> 4.543333e+00 2.500000e-01 0.000000e+00 2.100000e-01 5.250000e+00
>>>> 0.000000e+00
>>>>      SU1.2        SU1.3       SU_tot           SR
>>>> 1.556667e+00 4.225000e+01 3.504000e+01 4.225000e+01
>>>>
>>>> Which columns are constant?
>>>> which(sapply(expd[, -12], var) < .Machine$double.eps)
>>>> HP1.2 HO1.2 HO1.3 HO1.4 HU1.1 LP1.2 LO1.2 LO1.3 LU1.1 SP1.2 SO1.3 SU1.1
>>>>  19    24    25    26    28    35    40    41    44    51    57    60
>>>>
>>>> I suspect that in your real data set, there aren't so many constant
>>>> columns,
>>>> but this is one way to check.
>>>>
>>>> HTH,
>>>> Dennis
>>>>
>>>> On Wed, Sep 8, 2010 at 12:35 PM, Stephane Vaucher
>>>> <vauchers at iro.umontreal.ca
>>>>>
>>>>> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> I'm observing what I believe is weird behaviour when attempting to do
>>>>> something very simple. I want a correlation matrix, but my matrix seems
>>>>> to
>>>>> contain correlation values that are not found when executed on pairs:
>>>>>
>>>>>  test2$P2
>>>>>>
>>>>>  [1] 2 2 4 4 1 3 2 4 3 3 2 3 4 1 2 2 4 3 4 1 2 3 2 1 3
>>>>>
>>>>>> test2$HP_tot
>>>>>>
>>>>>  [1]  10  10  10  10  10  10  10  10 136 136 136 136 136 136 136 136
>>>>> 136
>>>>> 136  15
>>>>> [20]  15  15  15  15  15  15
>>>>> c=cor(test2$P3,test2$HP_tot,method='spearman')
>>>>>
>>>>>> c
>>>>>>
>>>>> [1] -0.2182876
>>>>>
>>>>>> c=cor(test2,method='spearman')
>>>>>>
>>>>> Warning message:
>>>>> In cor(test2, method = "spearman") : the standard deviation is zero
>>>>>
>>>>>> write(c,file='out.csv')
>>>>>>
>>>>>
>>>>> from my spreadsheet
>>>>> -0.25028783918741
>>>>>
>>>>> Most cells are correct, but not that one.
>>>>>
>>>>> If this is expected behaviour, I apologise for bothering you, I read
>>>>> the
>>>>> documentation, but I do not know if the calculation of matrices and
>>>>> pairs
>>>>> is
>>>>> done using the same function (eg, with respect to equal value
>>>>> observations).
>>>>>
>>>>> If this is not a desired behaviour, I noticed that it only occurs with
>>>>> a
>>>>> relatively large matrix (I couldn't reproduce on a simple 2 column data
>>>>> set). There might be a naming error.
>>>>>
>>>>>  names(test2)
>>>>>>
>>>>>  [1] "ID"                   "NOMBRE"               "MAIL"
>>>>>  [4] "Age"                  "SEXO"                 "Studies"
>>>>>  [7] "Hours_Internet"       "Vision.Disabilities"  "Other.disabilities"
>>>>> [10] "Technology_Knowledge" "Start_Time"           "End_Time"
>>>>> [13] "Duration"             "P1"                   "P1Book"
>>>>> [16] "P1DVD"                "P2"                   "P3"
>>>>> [19] "P4"                   "P5"                   "P6"
>>>>> [22] "P8"                   "P9"                   "P10"
>>>>> [25] "P11"                  "P12"                  "P7"
>>>>> [28] "SITE"                 "Errors"               "warnings"
>>>>> [31] "Manual"               "Total"                "H_tot"
>>>>> [34] "HP1.1"                "HP1.2"                "HP1.3"
>>>>> [37] "HP1.4"                "HP_tot"               "HO1.1"
>>>>> [40] "HO1.2"                "HO1.3"                "HO1.4"
>>>>> [43] "HO_tot"               "HU1.1"                "HU1.2"
>>>>> [46] "HU1.3"                "HU_tot"               "HR"
>>>>> [49] "L_tot"                "LP1.1"                "LP1.2"
>>>>> [52] "LP1.3"                "LP1.4"                "LP_tot"
>>>>> [55] "LO1.1"                "LO1.2"                "LO1.3"
>>>>> [58] "LO1.4"                "LO_tot"               "LU1.1"
>>>>> [61] "LU1.2"                "LU1.3"                "LU_tot"
>>>>> [64] "LR_tot"               "SP_tot"               "SP1.1"
>>>>> [67] "SP1.2"                "SP1.3"                "SP1.4"
>>>>> [70] "SP_tot.1"             "SO1.1"                "SO1.2"
>>>>> [73] "SO1.3"                "SO1.4"                "SO_tot"
>>>>> [76] "SU1.1"                "SU1.2"                "SU1.3"
>>>>> [79] "SU_tot"               "SR"
>>>>>
>>>>> Thank you in advance,
>>>>> Stephane Vaucher
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>>
>

-- 
Joshua Wiley
Ph.D. Student, Health Psychology
University of California, Los Angeles
http://www.joshuawiley.com/