[R] Factor tutorial?

Wed Oct 8 15:27:35 CEST 2008

Thank you very much. This will give me something to chew on for quite some time.

Kevin

---- Ted.Harding at manchester.ac.uk wrote: 
> On 07-Oct-08 22:23:22, Bert Gunter wrote:
> > But it **is** indexed in both of V&R's MASS and S Programming.
> > I have no idea whether the info there will be helpful to you,
> > of course. I would find (and have found) it so.
> > -- Bert Gunter
> 
> The discussion of factors in V&R is certainly quite comprehensive,
> but it is not for beginners!
> 
> A more elementary and very readable published text is Peter Dalgaard's
> "Introductory Statistics with R".
> 
> An even more introductory, but still adequate, account can be found
> in various places of Julian Faraway's "Practical Regression and Anova
> using R" which is on-line on CRAN under Documentation/Contributed.
> 
> However, you will need to piece together the bigger picture from
> passages found in various places. There is no index, but a search
> for "factor" in the PDF file throws up:
> pages 11; 69-70; Chapter 15 (160-167) -- especially section 15.2;
> Chapter 16 (168-203) -- though this deals mainly with factorial
> experimental designs.
> 
> A reference with more detail at the technical level from the R
> viewpoint (but still well spelt out) is John Maindonald's
> "Using R for Data Analysis and Graphics - Introduction, Examples
> and Commentary", especially section 2.4. This is also on-line in
> the same section of CRAN.
> 
> That being said, on the grounds that an introductory outline may
> also be useful to others, here is a summary.
> 
> Factors are variables which, essentially, introduce a "contingency
> table" structure into the data (and they can co-exist with variables
> which have quantitative interpretation).
> 
> A factor is a variable with categorical values -- an item is an "A",
> or a "B", or a "C", ... -- used in a particular way. It may or may
> not make sense to consider A, B, C, ... as ordered: A < B < C < ... say.
> For example, a variable called Sex may have values "M" (for Male)
> or "F" (for Female). Whether one can consider that M < F is something
> I will not discuss (though others may have a view).
> 
> Or Social Class may have categories A (highest) > B > C > D > E
> (lowest). Or, say, an ecological classification of terrain may use
> "Grassland", "Forest", "Swamp" with no implication of any ordering:
> they are all on the same footing.
> 
> The category labels of factors are called "Levels". As seen in the
> data, these labels may be alphabetic, numeric, or both -- e.g. M or F
> for Sex, which people also often code as 1 or 2 (but with no
> implication that 1 < 2); Terrain may be G, F or S or 1, 2, 3; Social
> Class my be subdivided into A1, A2, B1, B2, ... (with implied ordering
> A1 > A2 > B1 > B2 > ... ).
> 
> In regression analysis, the usefulness of factors is that they
> allow comparison between the outcomes for different levels of
> the factors. In simple cases the result may be as simple as
> the difference between the mean of cases with level A and the
> mean of cases with level B of sa single factor.
> 
> This is where the plot starts to thicken. For example, if Terrain
> were coded 1, 2, 3 you would not want to treat these as quantitative
> values (even if they represented ordered levels). Instead, a factor
> with k levels is presented to the regression in terms of k "dummy
> variables". If the regression model has an intercept, then one
> level (the "base level") of the factor will be absorbed into the
> Intercept.
> 
> So, for instance, data on weight(Kgm) might look like
> 
>   Sex  Weight
>   M    69.5
>   F    60.2
>   F    65.7
>   M    72.5
>   ....
> 
> This would be transformed into
> 
>   Sex.M  Sex.F  Weight
>   1      0      69.5
>   0      1      60.2
>   0      1      65.7
>   1      0      72.5
> 
> where, now, the 0s and 1s will have their *quantitative* interpetation.
> So the regression model Weight ~ Sex now becomes the quantitative
> regression
> 
>   Weight = a + b.M*Sex.M + b.F*Sex.F + error
> 
> using the values 0 and 1 of Sex.M and Sex.M quantitatively.
> However, since Sex.F + Sex.M = 1 throughout, one is redundant
> in the presence of the intercept (whose "dummy" equivalent has
> value 1 throughout). Hence the results of this regression will
> usually be presented as Intercept together with the coefficient
> of (say) Sex.F. However, if you left out the Intercept, giving
> the model formula Weight ~ Sex - 1, then the above data matrix
> with both dummy variables Sex.M and Sex.F would be used in full
> in the regression, whoch would fit the equation
> 
>   Weight = b.M$Sex.M + b.F*Sex.F + error
> 
> without redundancy (and in this case the coeficients would be
> the mean of the weights of Males [b.M] and the mean of the
> weights of Females [b.F]).
> 
> If there are two factors in the regression, say Sex (M/F) and
> Diet (M = meat-eater, V = vegetarian), then the possibilities
> are richer. One might then have, for the regression model
> 
>   Weight ~ Sex + Diet
> 
>   Sex.M  Sex.F  Diet.M  Diet.V  Weight
>   1      0      0       1       69.5
>   0      1      0       1       60.2
>   0      1      0       1       65.7
>   1      0      0       1       72.5
>   1      0      1       0       74.5
>   0      1      1       0       65.2
>   0      1      1       0       70.7
>   1      0      1       0       77.5
> 
> which would fit the equation
> 
>   Weight = b.S.F*Sex.F + b.D.V*Diet.V + error
> 
> with the same absorption of a base-level of each factor into the
> Intercept (since now we have 2 redundancies: for each factor,
> the two dummy variables add up to 1). The coefficient of Sex.F
> will represent a difference between Males and Females, the
> coefficient of Diet.V will represent a difference between
> meat-eaters and vegetarians. Because of the redundacies, an
> equivalent representation of the data used in the calculations is
> 
>   Sex.F  Diet.V  Weight
>   0      1       69.5
>   1      1       60.2
>   1      1       65.7
>   0      1       72.5
>   0      0       74.5
>   1      0       65.2
>   1      0       70.7
>   0      0       77.5
> 
> 
> But now we have the opportunity to ask: Is the difference
> between meat-eater and vegetarian Males the same as the
> difference between meat-eater and vegetarian Females? Now we
> need the Interaction -- the difference, between Males and
> Females, of the two differences between the two diets: one
> difference evaluated for Males, the other for Females. This
> leads to the regression model
> 
>   Weight ~ Sex * Diet, equivalent to Weight ~ Sex + Diet + Sex:Diet
> 
> and we now need a further dummy variable for the different
> combinations of levels of the two factors:
> 
>   Sex.F  Diet.V  Sex.F:Diet.V  Weight
>   0      1       0             69.5
>   1      1       1             60.2
>   1      1       1             65.7
>   0      1       0             72.5
>   0      0       0             74.5
>   1      0       0             65.2
>   1      0       0             70.7
>   0      0       0             77.5
> 
> where the variable Sex.F:Diet.V has the value 1 when Sex.F=1
> and Diet.V=1, and the value 0 otherwise.
> 
> This is all very basic and straightforward (though can appear
> more complicated in richer problems). But the point about using
> a variable of "factor" type in R is beginning to emerge. When
> there is a factor with k levels, you need (k-1) dummy variables
> as quantitative variables for the regression. Interactions
> introduce further dummy variables. For all this to happen, a
> variable which is going to be used as a factor needs a special
> representation inside R, so that R knows how to set about
> constructing all that stuff. So, in R, a factor is not a simple
> list of levels (like c("M","F","F","M","M","F","F","M")), but 
> a more elaborate encoding, and a more complex structure.
> 
> Once past this stage, there is then the question of what
> system of *contrasts* is going to be used. For 2-level factors
> (as above) there are not many issues which arise -- the effect
> of a factor corresponds to a simple difference between the
> results corresponding to its two levels. But, say, for the
> Terrain factor (G,F,S) there are several ways in which differences
> can be formulated. For example:
>   G, F-G, S-G ("treatment contrasts")
> 
> Or, for Social Class (ordered, A>B>C>D>E)
>   D-E, C-D, B-C, A-B ("successive difference contrasts")
>   E, D-E, C-(mean of D&E), B-(mean of C&D&E), A-(mean of B&C&D&E)
>     ("Helmert contrasts")
> 
> and so on. What system of contrasts you use will depend on what
> aspects of the differences between categories you are interested in.
> 
> And then the contrast specification also has to be part of the
> specification of a factor (since it determines how to compute
> the dummy variables which will represent it in the regression).
> See John Maindonald's on-line book.
> 
> Hoping this helps!
> Ted.
> 
> > -----Original Message-----
> > From: r-help-bounces at r-project.org
> > [mailto:r-help-bounces at r-project.org] On
> > Behalf Of rkevinburton at charter.net
> > Sent: Tuesday, October 07, 2008 2:29 PM
> > To: r-help at r-project.org
> > Subject: [R] Factor tutorial?
> > 
> > This is probably a very basic question. I want to understand factors
> > but I
> > am not sure where to turn. Looking up factor in the Chambers book
> > doesn't
> > even show up in the index. Maybe I am just slow but ?factor doesn't
> > help
> > either. Would someone please point me to a very basic tutorial where I
> > can
> > see what the usefullness of factors is (so far they have just gotten in
> > the
> > way).
> > 
> > Thank you.
> > 
> > Kevin
> > 
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> > 
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> 
> --------------------------------------------------------------------
> E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
> Fax-to-email: +44 (0)870 094 0861
> Date: 08-Oct-08                                       Time: 01:30:31
> ------------------------------ XFMail ------------------------------