[R] biserial correlation with pkg polycor

John Fox jfox at mcmaster.ca
Mon Oct 29 16:57:52 CET 2007


Dear Tom,

I'm afraid that you're confusing the point-biserial correlation with the
biserial correlation. It's the latter that polyserial() in the polycor
package computes; the biserial correlation is a special case when the
(ordered) categorical variable has just two categories. 

Generally, biserial correlations will be larger than point-serial
correlations, since the former estimates the correlation in an underlying
bivariate normal distribution that has been binned. I haven't looked closely
at your example, but suspect that the difference in sign is just due to the
arbitrary specification of which category is higher than the other. As for a
reference, ?polyserial gives one.

I hope this helps,
 John

--------------------------------
John Fox, Professor
Department of Sociology
McMaster University
Hamilton, Ontario
Canada L8S 4M4
905-525-9140x23604
http://socserv.mcmaster.ca/jfox 
-------------------------------- 

> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Tom Willems
> Sent: Monday, October 29, 2007 11:01 AM
> To: r-help at r-project.org
> Subject: [R] biserial correlation with pkg polycor
> 
> Dear R-ussers,
> 
> While looking for a way to calculate the association between 
> a countinuous and a binary variable, i found a procedure 
> called point biserial corralation.
> Me, not being a mathematicion, i did my very best to 
> understand what it was all about, and then i found a easily 
> understandable paper (by steve
> simon) on ow to calculate this. ref ##
> http://www.childrens-mercy.org/stats/definitions/biserial.htm 
> (this page has the same example) Further i discovered the 
> polycor package in R.
> Now i'm having troubles with the fact that the polycor pkg 
> never gives me the same output as the manuals aplication of 
> the formula.
> 
> In the example below found, manualy  r(biserial) = 0.49 
> between fb an age, and ussing function polyserial (polycor 
> pkg)  r(biserial) =-0.8591.
> This is a rather big difference, no due to abriviation or 
> flootingpoints.
> 
> Is there someone whom is familiar with biserial correlation, 
> and the appropriate way to calculate it?
> 
> Kind regards,
> Tom.
> 
> here is the example, at the end is the R file. 
> 
> 1e I create the input
> 
> > library(abind)
> > library (polycor)
> > ### data input
> > no <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8) 
> fb <- c(19, 
> > 30, 20, 19, 29, 25, 21, 24, 50, 25, 21, 17, 15, 14, 14, 22,
> 17) 
> > ss <- c(14, 41, 18, 11, 16, 24, 18, 21, 37, 17, 10, 16, 22, 12, 14, 
> > 12,
> 18)
> > age <-c('elderly', 'elderly', 'elderly', 'elderly', 'elderly',
> 'elderly', 'elderly', 'elderly', 'elderly', 'young', 'young', 
> 'young', 'young', 'young', 'young', 'young', 'young') 
> > 
> > dataset <- data.frame(no,fb,ss,age)
> > dataset <- subset(dataset,select=c(fb:age))
> > nrow(dataset)
> [1] 17
> > data_eld <- subset(dataset,age=='elderly',select=fb)
> > data_young <- subset(dataset,age=='young',select=fb)
> > 
> 
> here i calculate the R_bis (biserial corelation) manualy
> 
> > ### point biserial correlation
> > 
> >  fb <- subset(dataset,select=fb)
> >  fb0 <- subset(dataset,age!='elderly',select=fb)
> >  fb1 <- subset(dataset,age=='elderly',select=fb)
> >  meanfb0 <- mean(fb0,na.rm=T)
> >  meanfb1 <- mean(fb1,na.rm=T)
> >  sdfb<- sd(dataset$fb,na.rm=T)
> > 
> >  ss <- subset(dataset, select=ss)
> >  ss0 <- subset(dataset,age!='elderly',select=ss)
> >  ss1 <- subset(dataset,age=='elderly',select=ss)
> >  meanss0 <- mean(ss0,na.rm=T)
> >  meanss1 <- mean(ss1,na.rm=T)
> >  sdss<- sd(dataset$ss,na.rm=T)
> >  age <- subset(dataset,select=age) 
> >   n <- nrow(dataset)
> > 
> this is the formula from  ref ##
> http://www.childrens-mercy.org/stats/definitions/biserial.htm
> 
> > > R_bis <- function(x,x1,x0,n){  p <- (nrow(x1)/n)
> >+        (((mean(x1,na.rm=T)-mean(x0,na.rm=T))/sd(x,na.rm=T))
> *sqrt(p*(1-p)))  } 
> 
> this is the corrected formula from  ref ## 
> http://en.wikipedia.org/wiki/Point-biserial_correlation_coefficient
> >> R_bis2 <- function(x,x1,x0,n){ 
> +        ((mean(x1,na.rm=T)-mean(x0,na.rm=T))/sd(x,na.rm=T))  
> * (sqrt( 
> (nrow(x1)*nrow(x0))/(n*(n-1))))}
> > 
> >>  R_bis(fb,fb1,fb0,n) 
> >       fb 
> >0.4798873 
> 
> result in paper was 0.49 
>  
> >>   R_bis2(fb,fb1,fb0,n)
> >       fb 
> >0.4946565 
> 
> equals result in paper  0.49 
> 
> Then i use the polycor package,
> function hetcor will give all the different correlation ressults
> 
>  
> >> hetcor(dataset$fb,dataset$ss,dataset$age ,ML=TRUE)
> >
> >Maximum-Likelihood Estimates
> >
> >Correlations/Type of Correlation:
>  >           dataset$fb dataset.ss dataset.age
> >dataset$fb           1    Pearson  Polyserial
> >dataset.ss       0.703          1  Polyserial
> >dataset.age    -0.8591    -0.6685           1
> >
> >Standard Errors:
> >           dataset$fb dataset.ss
> >dataset$fb 
> >dataset.ss      0.1215 
> >dataset.age     0.1106     0.2497
> >
> >n = 17 
> >
> >P-values for Tests of Bivariate Normality:
> >            dataset$fb dataset.ss
> >dataset$fb 
> >dataset.ss      0.1782 
> >dataset.age     0.4269     0.4034
> > hetcor(dataset,ML=TRUE)
> >
> >Maximum-Likelihood Estimates
> >
> >Correlations/Type of Correlation:
>  >        fb      ss        age
> >fb        1 Pearson Polyserial
> >ss    0.703       1 Polyserial
> >age -0.8591 -0.6685          1
> >
> >Standard Errors:
>  >       fb     ss
> >fb 
> >ss  0.1215 
> >age 0.1106 0.2497
> >
> >n = 17 
> >
> >P-values for Tests of Bivariate Normality:
> >       fb     ss
> >fb 
> >ss  0.1782 
> >age 0.4269 0.4034
> 
> here a quick two step method is ussed to calculate the polyserial 
> correlation
> 
> > >polyserial(dataset$fb,dataset$age)
> >[1] -0.6205737
> >> polyserial(dataset$fb,dataset$age, ML=TRUE, std.err=TRUE) 
> 
> same method  as in hetcor, only for indecated variables
> 
> >Polyserial Correlation, ML est. = -0.8591 (0.1106)
> >Test of bivariate normality: Chisquare = 4.91, df = 5, p = 0.4269
> >
> >               1
> >Threshold 0.1811
> >Std.Err.  0.1849
> > 
> 
> 
> > ### for side to side (ss)   incase no 9 is an outlier in 
> fb, this will 
> not be the case in ss 
> > 
> >> R_bis(ss,ss1,ss0,n)
> >       ss 
> >0.4153681 
> 
> result in paper was 0.43 
> 
> >> R_bis2(ss,ss1,ss0,n)
> >       ss 
> >0.4281516 
> 
> equals result in paper  0.43
> 
> >> polyserial(dataset$ss,dataset$age) 
> >[1] -0.5371397
> 
> >> polyserial(dataset$ss,dataset$age, ML=TRUE, std.err=TRUE)
> >
> >Polyserial Correlation, ML est. = -0.6685 (0.2497)
> >Test of bivariate normality: Chisquare = 5.103, df = 5, p = 0.4034
> >
> >               1
> >Threshold 0.1504
> >Std.Err.  0.2583
> ##############################################################
> ###################
> 
> ### R-file
> ##############################################################
> ###################
> 
> library(abind) 
> library (polycor)
> ### data input
> no <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8)
> fb <- c(19, 30, 20, 19, 29, 25, 21, 24, 50, 25, 21, 17, 15, 
> 14, 14, 22, 
> 17) 
> ss <- c(14, 41, 18, 11, 16, 24, 18, 21, 37, 17, 10, 16, 22, 
> 12, 14, 12, 
> 18)
> age <-c('elderly', 'elderly', 'elderly', 'elderly', 
> 'elderly', 'elderly', 
> 'elderly', 'elderly', 'elderly', 'young', 'young', 'young', 'young', 
> 'young', 'young', 'young', 'young') 
> 
> dataset <- data.frame(no,fb,ss,age)
> dataset <- subset(dataset,select=c(fb:age))
> nrow(dataset) 
> data_eld <- subset(dataset,age=='elderly',select=fb)
> data_young <- subset(dataset,age=='young',select=fb)
> 
> ### point biserial correlation
> 
>  fb <- subset(dataset,select=fb)
>  fb0 <- subset(dataset,age!='elderly',select=fb)
>  fb1 <- subset(dataset,age=='elderly',select=fb)
>  meanfb0 <- mean(fb0,na.rm=T)
>  meanfb1 <- mean(fb1,na.rm=T)
>  sdfb<- sd(dataset$fb,na.rm=T)
> 
>  ss <- subset(dataset, select=ss)
>  ss0 <- subset(dataset,age!='elderly',select=ss)
>  ss1 <- subset(dataset,age=='elderly',select=ss)
>  meanss0 <- mean(ss0,na.rm=T)
>  meanss1 <- mean(ss1,na.rm=T)
>  sdss<- sd(dataset$ss,na.rm=T) 
>  age <- subset(dataset,select=age) 
>   n <- nrow(dataset)
>  
>  R_bis <- function(x,x1,x0,n){  p <- (nrow(x1)/n)
>        (((mean(x1,na.rm=T)-mean(x0,na.rm=T))/sd(x,na.rm=T)) 
> *sqrt(p*(1-p)))  }
> R_bis2 <- function(x,x1,x0,n){ 
>        ((mean(x1,na.rm=T)-mean(x0,na.rm=T))/sd(x,na.rm=T))  * (sqrt( 
> (nrow(x1)*nrow(x0))/(n*(n-1))))}
>  
>  R_bis(fb,fb1,fb0,n)
>   R_bis2(fb,fb1,fb0,n)
> 
>  
> hetcor(dataset$fb,dataset$ss,dataset$age ,ML=TRUE)
> hetcor(dataset,ML=TRUE)
> 
> polyserial(dataset$fb,dataset$age)
> polyserial(dataset$fb,dataset$age, ML=TRUE, std.err=TRUE) 
> 
> ### for side to side (ss)   incase no 9 is an outlier in fb, 
> this will not 
> be the case in ss 
> 
> R_bis(ss,ss1,ss0,n)
> R_bis2(ss,ss1,ss0,n)
> 
> polyserial(dataset$ss,dataset$age) 
> polyserial(dataset$ss,dataset$age, ML=TRUE, std.err=TRUE)
> 
> ## END 
> ##############################################################
> ############
> 
> Willems Tom
> 
> E-mail: towil at var.fgov.be
> 
> 
> 
> 
> Disclaimer: click here
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list