[R] LDA with previous PCA for dimensionality reduction

Torsten Hothorn Torsten.Hothorn at rzmail.uni-erlangen.de
Wed Nov 24 15:43:13 CET 2004


On Wed, 24 Nov 2004, Ramon Diaz-Uriarte wrote:

> Dear Cristoph,
>
> I guess you want to assess the error rate of a LDA that has been fitted to a
> set of currently existing training data, and that in the future you will get
> some new observation(s) for which you want to make a prediction.
> Then, I'd say that you want to use the second approach. You might find that
> the first step turns out to be crucial and, after all, your whole subsequent
> LDA is contingent on the PC scores you obtain on the previous step.

Ramon,

as long as one does not use the information in the response (the class
variable, in this case) I don't think that one ends up with an
optimistically biased estimate of the error (although leave-one-out is
a suboptimal choice). Of course, when one starts to "tune" the method
used for dimension reduction, a selection of the procedure with minimal
error will produce a bias. Or am I missing something important?

Btw, `ipred::slda' implements something not completely unlike the
procedure Christoph is interested in.

Best,

Torsten

> Somewhat
> similar issues have been discussed in the microarray literature. Two
> references are:
>
>
> @ARTICLE{ambroise-02,
>   author = {Ambroise, C. and McLachlan, G. J.},
>   title = {Selection bias in gene extraction on the basis of microarray
> gene-expression data},
>   journal = {Proc Natl Acad Sci USA},
>   year = {2002},
>   volume = {99},
>   pages = {6562--6566},
>   number = {10},
> }
>
>
> @ARTICLE{simon-03,
>   author = {Simon, R. and Radmacher, M. D. and Dobbin, K. and McShane, L. M.},
>   title = {Pitfalls in the use of DNA microarray data for diagnostic and
> prognostic classification},
>   journal = {Journal of the National Cancer Institute},
>   year = {2003},
>   volume = {95},
>   pages = {14--18},
>   number = {1},
> }
>
>
> I am not sure, though, why you use PCA followed by LDA. But that's another
> story.
>
> Best,
>
>
> R.
>
> On Wednesday 24 November 2004 11:16, Christoph Lehmann wrote:
> > Dear all, not really a R question but:
> >
> > If I want to check for the classification accuracy of a LDA with
> > previous PCA for dimensionality reduction by means of the LOOCV method:
> >
> > Is it ok to do the PCA on the WHOLE dataset ONCE and then run the LDA
> > with the CV option set to TRUE (runs LOOCV)
> >
> > -- OR--
> >
> > do I need
> > - to compute for each 'test-bag' (the n-1 observations) a PCA
> > (my.princomp.1),
> > - then run the LDA on the test-bag scores (-> my.lda.1)
> > - then compute the scores of the left-out-observation using
> > my.princomp.1 (-> my.scores.2)
> > - and only then use predict.lda(my.lda.1, my.scores.2) on the scores of
> > the left-out-observation
> >
> > ?
> > I read some articles, where they choose procedure 1, but I am not sure,
> > if this is really correct?
> >
> > many thanks for a hint
> >
> > Christoph
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> > http://www.R-project.org/posting-guide.html
>
> --
> Ramón Díaz-Uriarte
> Bioinformatics Unit
> Centro Nacional de Investigaciones Oncológicas (CNIO)
> (Spanish National Cancer Center)
> Melchor Fernández Almagro, 3
> 28029 Madrid (Spain)
> Fax: +-34-91-224-6972
> Phone: +-34-91-224-6900
>
> http://ligarto.org/rdiaz
> PGP KeyID: 0xE89B3462
> (http://ligarto.org/rdiaz/0xE89B3462.asc)
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>
>




More information about the R-help mailing list