[R] Problem with Poisson - Chi Square Goodness of Fit Test - New Mail

Madhavi Bhave madhavi_bhave at yahoo.com
Fri Aug 29 12:02:42 CEST 2008





Dear R-help,

 

 

Chi Square Test for Goodness of Fit

 

I have got a discrete data
as given below (R script)

 

No_of_Frauds<-c(1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,2,1,2,2,2,1,1,2,1,1,1,1,4,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,5,1,2,1,1,1,1,1,1,1,3,2,1,1,1,2,1,1,2,1,1,1,1,1,2,1,3,1,2,1,2,14,2,1,1,38,3,3,2,44,1,4,1,4,1,2,2,1,3)

 

I am trying to fit Poisson
distribution to this data using R.

 

My R script is as under :

 

________________________________________________________

 

# R SCRIPT for Fitting
Poisson Distribution

 

No_of_Frauds<-c(1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,2,1,2,2,2,1,1,2,1,1,1,1,4,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,5,1,2,1,1,1,1,1,1,1,3,2,1,1,1,2,1,1,2,1,1,1,1,1,2,1,3,1,2,1,2,14,2,1,1,38,3,3,2,44,1,4,1,4,1,2,2,1,3)

 

N              <-             length(No_of_Frauds)

 

Average     <-             mean(No_of_Frauds)

 

Lambda     <-             Average

 

i               <-             c(0:(N-1))

 

pmf           <-             dpois(i, Lambda, log = FALSE)

 

#
----------------------------------------------------------------------------

 

# Ho: The data follow Poisson
Distribution Vs H1: Not Ho

 

# observed frequencies (Oi)

 

variable.cnts
      <-     table(No_of_Frauds)

variable.cnts.prs
 <-     dpois(as.numeric(names(variable.cnts)),
lambda)

variable.cnts
      <-     c(variable.cnts, 0)

 

variable.cnts.prs <-     c(variable.cnts.prs,
1-sum(variable.cnts.prs))

tst
                   <-     chisq.test(variable.cnts,
p=variable.cnts.prs)

 

chi_squared
       <-     as.numeric(unclass(tst)$statistic)

p_value             <-     as.numeric(unclass(tst)$p.value)

df
                    <-     tst[2]$parameter

 

 

cv1                    <-     qchisq(p=.01, df=tst[2]$parameter, lower.tail = FALSE, log.p =
FALSE)

 

cv2                    <-     qchisq(p=.05, df=tst[2]$parameter, lower.tail = FALSE, log.p =
FALSE)

 

cv3                    <-     qchisq(p=.1, df=tst[2]$parameter, lower.tail = FALSE, log.p =
FALSE)

 

#-----------------------------------------------------------------------------

 

# Expected value

 

# variable.cnts.prs *
sum(variable.cnts) 

 

 

#
if tst > cv reject Ho at alpha confidence level

 

#-----------------------------------------------------------------------------

 

if(chi_squared > cv1)

 

Conclusion1 <- 'Sample
does not come from the postulated probability distribution at 1% los' else

Conclusion1 <- 'Sample
comes from postulated prob. distribution at 1% los'

 

 

if(chi_squared > cv2)

 

Conclusion2 <- 'Sample
does not come from the postulated probability distribution at 5% los' else

Conclusion2 <- 'Sample
comes from postulated prob. distribution at 1% los'

 

if(chi_squared > cv3)

Conclusion3 <- 'Sample
does not come from the postulated probability distribution at 10% los' else

Conclusion3 <- 'Sample
come from postulated prob distribution at 1% los'

 

#-----------------------------------------------------------------------------

 

# Printing RESULTS 

 

print(chi_squared)

 

print(p_value)

 

print(df)

 

print(cv1)

 

print(cv2)

 

print(cv3)

 

print(Conclusion1)

 

print(Conclusion2)

 

print(Conclusion3)

 

 

##### End of R Script
########

 

________________________________________________________

 

Problem Faced :

 

When I run this script using
R – console,

 

I am getting value of Chi – Square Statistics as
high as “6.95753e+37”

 

When I did the same calculations in Excel, I got
the Chi Square Statistics value = 138.34.


 

Although it is clear that the sample data doesn’t
follow Poisson distribution, and I will have to look for other discrete
distribution, my problem is the HIGH Value of Chi Square test statistics. When
I analyzed further, I understood the problem. 

 

(A) By convention, if your Expected
frequency is less than 5, then by we put together such classes and form a new
class such that Expected frequency is greater than 5 and also accordingly
adjust the observed frequencies.

 





  
  X
  
  
  Oi
  
  
  Ei
  
  
  ((Oi - Ei)^2)/Ei
  


  
  0
  
  
  0
  
  
  10
  
  
  9.96
  


  
  1
  
  
  72
  
  
  23
  
  
  103.79
  


  
  2
  
  
  17
  
  
  27
  
  
  3.54
  


  
  3
  
  
  5
  
  
  21
  
  
  11.85
  


  
  4
  
  
  3
  
  
  12
  
  
  6.71
  


  
  5
  
  
  4
  
  
  9
  
  
  2.51
  


  
  Total
  
  
  101
  
  
  101
  
  
  138.34
  





 

 

When I apply this logic in Excel, I am getting the
reasonable result (i.e. 138.34), however in Excel also, if I don’t apply this
logic, my Chi square test statistic value is as high as 4.70043E+37.

 

My
question is how do I modify my R – script, so that the logic mentioned in (A)
i.e. adjusting the Expected frequencies (and accordingly Observed frequencies) is
applied so that the expected frequency becomes greater than 5 for a given
class, thereby resulting in reasonable value of Chi Square test Statistics.

 

I am also attaching the xls file for ready
reference.

 

I sincerely apologize for taking liberty of writing
such a long mail and since I am very new to this “R language” can someone help
me out.

 

Thanking in advance for your kind co-operation.

 

Ashok (Mumbai,
 India)

 

 

 

 

 




More information about the R-help mailing list