[R] Subsetting dataframes

CG Pettersson cg.pettersson at vpe.slu.se
Thu Jul 19 11:52:09 CEST 2007

Dear all!

W2k, R 2.5.1

I am working with an ongoing malting barley variety evaluation within
Sweden. The structure is 25 cultivars tested each year at four sites, in
field trials with three replicates and 'lattice' structure (the replicates
are divided into five sub blocks in a structured way). As we are normally
keeping around 15 varieties from each year to the next, and take in 10 new
for next year, we have tested totally 72 different varieties during five

I store the data in a field trial database, and generate text tables with
the subset of data I want and import the frame to R. I take in all
cultivars in R and use 'subset' to select what I want to look at. Using
lme{nlme} works with no problems to get mean results over the years, but
as I now have a number of years I want to analyse the general site x
cultivar relation. I am testing AMMI{agricolae} for this and it seems to
work except for the subsetting. This is what happens:

If I do the subsetting like this:

x62_samvar <- subset(x62_5, cn %in%
c("Astoria","Barke","Christina","Makof", "Prestige","Publican","Quench"))

A test run with AMMI seems to work in the first part:

> AMMI(site, cn, rep, yield)

Class level information

ENV:  Hag Klb Bjt Ska
GEN:  Astoria Prestige Makof Christina Publican Quench
REP:  1 2 3

Number of observations:  240

model Y: yield  ~ ENV + REP%in%ENV + GEN + ENV:GEN

Analysis of Variance Table

Response: Y
           Df    Sum Sq   Mean Sq F value    Pr(>F)
ENV         3 120092418  40030806 90.0424 1.665e-06 ***
REP(ENV)    8   3556620    444578  0.5674  0.803923
GEN         5  21376142   4275228  5.4564 9.680e-05 ***
ENV:GEN    15  28799807   1919987  2.4504  0.002555 **
Residuals 208 162973213    783525
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Coeff var       Mean yield
13.08629         6764.098

After this something goes wrong, as AMMI finds a cultivar name not
selected in the subsetting. (The plotting might go wrong anyhow, but I
haven´t got that far yet):

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev =
object$xlevels) :
        factor 'y' has new level(s) Arkadia

Looking at the dataframe using

> edit(x62_samvar)

only shows the selected lines, but using levels() gives another answer as

> levels(x62_samvar$cn)

gives back all 72 cultivar names used during the five years (starting with

Where do I go wrong and how do I use subset in a proper way?


CG Pettersson, PhD
Swedish University of Agricultural Sciences (SLU)
Dept. of Crop Production Ecology. Box 7043.
SE-750 07 Uppsala, Sweden
cg.pettersson at vpe.slu.se

More information about the R-help mailing list