[R] p-values and temporal uncertainty in variables

Mon May 18 19:00:14 CEST 2015

Dear List,

This is one of those questions that are easier to be explained with an example. I also apologise as strictly speaking this is not an R question, but more general methodological question that I would like to solve in R.  

Suppose we have the following data

#Generate artificial data#
DateMeans<-c(900,1000,1250,1300,1380,1400,1400,1500,1550)
DateSDs<-c(300,400,100,100,60,60,120,100,50)
index=sample(1:9,size=400,prob=c(0.238,0.119,0.238,0.048,0.095,0.048,0.048,0.024,0.143),replace=TRUE)
A1=rnorm(100,mean=5,sd=1)
A2=rnorm(100,mean=5,sd=1)
B1=rnorm(100,mean=5,sd=2)
B2=rnorm(100,mean=5,sd=1.5)

data=data.frame(type=c(rep("a",200),rep("b",200)),size=c(A1,A2,B1,B2),dateAvg=DateMeans[index]+round(runif(400,min=0,max=100)),dateSd=DateSDs[index]+round(runif(400,min=0,max=20)))
head(data)

Let’s say we are measuring the size of two objects labelled “a” and “b” (the column “type”), and we are interested whether one is bigger than the other. They are normally distributed so a standard t-test will suffice. However, suppose also that these object have a time-stamp (column dateAvg), each with some degree of uncertainty (column dateSd) described as a Gaussian error. If I want to now the average size of type “a” for a given interval (let’s say between 1300 and 1400), I could simply resample as follow:

nsim<-1000 #number of simulations
MeanOfA<-numeric(length=nsim)

for (s in 1:nsim)
{
simdates<-rnorm(nrow(data),mean=data$dateAvg,sd=data$dateSd)
i=which(simdates>=interestBlock[1]&simdates<=interestBlock[2])
MeanOfA[s]=mean(subset(data[i,],type=="a")$size)
}

This will give me a distribution of means that will include the uncertainty of our summary statistics. I think there is nothing wrong with this.

But suppose I want to test whether objects of type “a” are different in size in comparison with objects of type “b”. I don’t think comparing the means obtained from the simulation is fair, as the significance will change as a function of the number of simulations. All I can think of is to compute a t-test for each iteration, something like the following:

nsim=1000
interestBlock=c(1400,1500)
statistic<-numeric(length=nsim)
pvals<-numeric(length=nsim)
diffs<-numeric(length=nsim)
for (s in 1:nsim)
{
simdates<-rnorm(nrow(data),mean=data$dateAvg,sd=data$dateSd)
i=which(simdates>=interestBlock[1]&simdates<=interestBlock[2])
tmpdata=data[i,]
diffs[s]=mean(subset(tmpdata,type=="a")$size)-mean(subset(tmpdata,type=="b")$size)
res=t.test(x=subset(tmpdata,type=="a")$size,y=subset(tmpdata,type=="b")$size)
statistic[s]=res$statistic
pvals[s]=res$p.value  
}

And then get a distribution of p-values (that I frankly find useless). Is there a way to get a single p-value? The only alternative I can think of is to get a 95% confidence interval of the difference in mean, and see if this includes no difference in mean or not…. 

Many thanks in advance,
Enrico