[R] Help in using PCR

Tue Jul 1 09:03:31 CEST 2008

On Tue, 2008-07-01 at 10:54 +1000, Jason Lee wrote:
> Hi,
> 
> Currently I have a dataset of 2400*408. And I would like to apply PCR method
> to study the any correlation between the tests.
> My current data is in data.frame and I have formed horizontal(1-407) to be
> the exact data, and (408) to be my results data(Yes and No)
> I have also binarized these Yes and No to 1 and -1s.
> 
> However, when I refer to PCR manual on R, the example of  yarn.pcr <-
> pcr(density ~ NIR, 6, data = yarn, validation = "CV"), I
> am not sure how can I adapt the command based line to my sample dataset.
> 

In the yarn data set, NIR is a matrix with columns representing near
infra-red spectra at 268 wavelengths (i.e. variables) on 28 yarns (the
samples, 7 of which are a test set). Take a look at:

str(yarn)

class(yarn$NIR)

A matrix is allowed on the rhs of a model formula which is why this
works.

This is a reasonably standard model formula in R, something that you'll
come across more and more if you use R for a short amount of time. These
formulae are a symbolic way of describing the model in the form:

response ~ rhs

where response is (are) the response variable(s) or thing you are trying
to predict, ~ means "is modelled by", and rhs contains the definition of
the model matrix (i.e. the set of predictor or explanatory variables),
such as

density ~ var1 + var2 + var3*var4

(which includes main and interaction terms for var3 and var4 via the use
of the '*'). This says that density is modelled as a function of var1,
var2, var3 and var4, plus and interaction term between var3 and var4.

In the main, you will see that the rhs normally refers directly to named
variables as in my last example. This would be tedious with 268
variables, so in the yarn example a matrix containing these 268
predictor variables is stated, rather than having to name all 268
wavelengths.

You can do this another way though, that I feel is more natural. So lets
assume that your data frame contains columns that are named, and that
one of these is the response variable, the remaining columns are the
predictors. Further assume that this response is called 'myresp', then
you can proceed by the following:

cancerv1.pcr <- pcr(myresp ~ . , ncomp = 6, data = cancerv1,
                    validation = "CV")

What this means is myresp is modelled by '.' and '.' is shorthand for
all variables in 'data' not currently in the model (i.e. myresp is not
included on the rhs). So as long as your data frame contains both the
response and the explanatory variables this will work.

This is a fundamental feature of using R's modelling functions. As such
you need to be come familiar with model formulae, so take a look
at ?formula and also at the relevant section in An Introduction to R:

http://cran.r-project.org/doc/manuals/R-intro.html

Or some of the introductory materials in the contributed documentation
section of the R website:

http://cran.r-project.org/other-docs.html

HTH

G

> It seems that they label each horizontal (columns) as NIR and followed by
> Density (which is my results data).  My doubt is
> do I have to label these data at the first place? If not, what
> variables/command that I should put in place of density?
> 
>  cancerv1.pcr<-pcr(cancerv1[,1-407],6,data=cancerv1,validation="CV")?
> 
> Please advise. Thanks.
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.