[R] posterior probabilities from lda.predict

Fri Aug 29 23:13:53 CEST 2014

Dear All,

I have used the lda() function in the MASS library to estimate a set of discriminant functions to assign samples from a training set to one of six groups.  The cross validation generates nearly perfect predictions for samples in the training set.  Hooray!

Now I want to use lda.predict() to estimate both discriminant function scores and probabilities of group membership for a second set of samples whose group membership is unknown.  For each unknown sample, lda.predict() produces a six probabilities. These probabilities sum to one. So lda.predict() seems to assume that the unknown samples do, in fact, belong to one of the six groups.  

The problem is that it is nearly certain that some of the unknown samples in the second set do not belong to any of the six groups. For those samples, probabilities of group membership should be close to zero for all six groups.  In fact, identifying which samples are unlikely to belong to any of the six groups is a major goal of the analysis. 

So the question is, what is lda.predict() doing behind the scenes to force the group membership probabilities to sum to one? How do I get it to not do this and produce probabilities that accurately reflect the large Mahalanobis distances of some of the unknown sample from any group centroid?\

I have searched the R-list archive on this and have found several folks asking similar questions, but no helpful answers.

Thanks very much!

Fraser