[R] caret() train based on cross validation - split dataset to keep sites together?

Wed May 30 13:55:32 CEST 2012

Hello all, 

I have searched and have not yet identified a solution so now I am sending
this message. In short, I need to split my data into training, validation,
and testing subsets that keep all observations from the same sites together
– preferably as part of a cross validation procedure. Now for the longer
version. And I must confess that although my R skills are improving, they
are not so highly developed. 

I am using 10 fold cross validation with 3 repeats in the train function of
the caret() package to identify an optimal nnet (neural network) model to
predict daily river water temperature at unsampled sites. I am also
withholding data from 10% of sites to have a better understanding of
generalization error. However, the focus on predictions at other sites is
turning out to be not easily facilitated – as far as I can see.  My data
structure (example at bottom of email) consists of columns identifying the
site, the date, the water temperature on that day for the site (response
variable), and many predictors.  There are over 220,000 individual
observations at ~1,000 sites, and each site has a minimum of 30
observations.  It is important to keep sites separate because selecting a
model based on predictions at an already sampled site is likely
overly-optimistic.  

Is there a way to split data for (or preferably during) cross validation
procedure to: 

1.) Selects a separate validation dataset from 10% of sites 
2.) Splits remaining training data into cross validation subsets and most
importantly, keeping all observations from a site together
3.) Secondarily, constrain partitions to be similar - ideally based on
distributions of all variables

It seems that some combination of the sample.split function of the caTools()
package and the createdataPartition function of caret() might do this, but I
am at a loss for how to code that.  

If this is not possible, I would be content to skip the cross validation
procedure and create three similar splits of my data that keep all
observations from a site together – one for training, one for testing, and
one for validation.  The alternative goal here would be to split the data
where 80% of sites are training, 10% of sites are for testing (model
selection), and 10% of sites for validation.  

Thank you and please let me know if there are any remaining questions.  This
is my first post as well, so if I left anything out that would be good to
know as well. 

Tyrell Deweber

R version 2.13.1 (2011-07-08)
Copyright (C) 2011 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-redhat-linux-gnu (64-bit)

Comid   tempymd            watmntemp       airtemp         predictorb    

15433    1980-05-01          11.4          22.1                 

15433    1980-05-02          11.6          23.6                 

15433    1980-05-03          11.2          28.5
15687    1980-06-01          13.5          26.5
15687    1980-06-02          14.2          26.9
15687    1980-06-03          13.8          28.9
18994    1980-04-05          8.4           16.4
18994    1980-04-06          8.3           12.6
90342    1980-07-13          18.9          22.3
90342    1980-07-14          19.3          28.4

EXAMPLE SCRIPT FOR MODEL FITTING

fitControl <- trainControl(method = "repeatedcv", number=10, repeats=3)

tuning <- read.table("temptunegrid.txt",head=T,sep=",")
tuning

# # Model with 100 iterations 
registerDoMC(4)
tempmod100its <- train(watmntemp~tempa + tempb + tempc + tempd + tempe +
netarea + netbuffor + strmslope + 
	netsoilprm + netslope + gwndx + mnaspect + urb + ag + forest +
buffor + tempa7day + tempb7day + 
	tempc7day + tempd7day + tempe7day +  tempa30day + tempb30day +
tempc30day + tempd30day +
	tempe30day, data = temp.train, method = "nnet", linout=T, maxit =
100, 
	MaxNWts = 100000, metric = "RMSE", trControl = fitControl, tuneGrid
= tuning, trace = T)