[R] posterior probabilities from lda.predict

Sun Aug 31 01:20:49 CEST 2014

Function predict.lda() is just answering a different question from the one you are posing. It is answering the question, given the values on this object what is the probability of membership in each of the groups used to construct the discriminant functions in the first place. Those probabilities sum to 1 and are generally called the posterior probabilities. Your question is somewhat different, if this object was a member of group x, what is the probability that it would have values like these. These are typicality probabilities (how typical is this observation in this group). 

There are two ways to compute typicality probabilities. One is to use the reduced space defined by the discriminant functions and measure the distance of a new observation to the centroid of the group. This is the approach taken by SPSS which provides the typicality for the group which has the highest posterior probability. Huberty and Olejink recommend this procedure on the grounds that the probability distribution is known. The alternate approach which is used commonly in compositional analysis is to use Mahalanobis distance with the probability assumed to follow a chi square distribution. I am not aware of a package that has a function to produce either of these.

Huberty, Carl J. and Stephen Olejink. 2006. Applied Manova and Discriminant Analysis. Second Edition. Wiley-Interscience.

David L. Carlson
Department of Anthropology
Texas A&M University

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Fraser D. Neiman
Sent: Friday, August 29, 2014 4:14 PM
To: r-help at r-project.org
Subject: [R] posterior probabilities from lda.predict

Dear All,

I have used the lda() function in the MASS library to estimate a set of discriminant functions to assign samples from a training set to one of six groups.  The cross validation generates nearly perfect predictions for samples in the training set.  Hooray!

Now I want to use lda.predict() to estimate both discriminant function scores and probabilities of group membership for a second set of samples whose group membership is unknown.  For each unknown sample, lda.predict() produces a six probabilities. These probabilities sum to one. So lda.predict() seems to assume that the unknown samples do, in fact, belong to one of the six groups.  

The problem is that it is nearly certain that some of the unknown samples in the second set do not belong to any of the six groups. For those samples, probabilities of group membership should be close to zero for all six groups.  In fact, identifying which samples are unlikely to belong to any of the six groups is a major goal of the analysis. 

So the question is, what is lda.predict() doing behind the scenes to force the group membership probabilities to sum to one? How do I get it to not do this and produce probabilities that accurately reflect the large Mahalanobis distances of some of the unknown sample from any group centroid?\

I have searched the R-list archive on this and have found several folks asking similar questions, but no helpful answers.

Thanks very much!

Fraser
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.