[R] highly biased PCA data?

Liaw, Andy andy_liaw at merck.com
Fri Nov 5 01:53:59 CET 2004


I am no expert on this sort of matters, but that has never stopped me from
tossing in my $0.02...

As Gabor and Bert hinted, this is what I would try:

Run randomForest on the data, using sampsize=c(10, 10, 10) and
importance=TRUE, for example.  Then take the few most important variables
with respect to each class and maybe do PCA on those to see if you can see
separation.

HTH,
Andy

> From: Dan Bolser
> 
> On Thu, 4 Nov 2004, Berton Gunter wrote:
> 
> >
> >Dan:
> >
> >
> >1) There is no guarantee that PCA will show separate groups, 
> of course, as
> >that is not its purpose, although it is frequently a side effect.
> >
> >2) If you were to use a classification method of some sort 
> (discriminant
> >analysis, neural nets, SVM's, model=based classification,  ...), my
> >understanding is that yes, indeed, severely unbalanced group 
> membership
> >would, indeed, affect results. A guess is that Bayesian or 
> other methods
> >that could explicitly model the prior membership 
> probabilities would do
> >better. To make it clear why, suppose that there was a 99.9% 
> preference of
> >"dog" and .05% each of the others. Than your datasets would 
> have almost no
> >information on how covariates could distinguish the classes 
> and the best
> >classifier would be to call everything a "dog" no matter 
> what values the
> >covariates had.
> >
> >I presume experts will have more and better to say about this.
> 
> Sounds interesting. Thanks very much for the input. Just out 
> of curiosity,
> given that I can make my data more uniform (less biased), how 
> could I best
> generate a 2d plot to encapsulate the clusters (and inter cluster
> relationships)?
> 
> Actually I am thinking of a 2d density.
> 
> 
> >
> >-- Bert Gunter
> >Genentech Non-Clinical Statistics
> >South San Francisco, CA
> > 
> >"The business of the statistician is to catalyze the 
> scientific learning
> >process."  - George E. P. Box
> > 
> > 
> >
> >> -----Original Message-----
> >> From: r-help-bounces at stat.math.ethz.ch 
> >> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Dan Bolser
> >> Sent: Thursday, November 04, 2004 9:41 AM
> >> To: R mailing list
> >> Subject: [R] highly biased PCA data?
> >> 
> >> 
> >> Hello, supposing that I have two or three clear categories 
> >> for my data,
> >> lets say pet preferece across fish, cat, dog. Lets say most 
> >> people rate
> >> their preference as being mostly one of the categories.
> >> 
> >> I want to do pca on the data to see three 'groups' of people, 
> >> one group
> >> for fish, one for cat and one for dog. I would like to see 
> >> the odd person
> >> who likes both or all three in the (appropriate) middle of 
> >> the other main
> >> groups.
> >> 
> >> Will my data be affected by the fact that I have 
> interviewed 1000 dog
> >> owners, 100 cat owners and 10 fish owners? (assuming that 
> >> each scale of
> >> preference has an equal range). 
> >> 
> >> Cheers,
> >> dan.
> >> 
> >> ______________________________________________
> >> R-help at stat.math.ethz.ch mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide! 
> >> http://www.R-project.org/posting-guide.html
> >> 
> >
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
>




More information about the R-help mailing list