[R] estimation problem

Fri May 4 19:43:32 CEST 2012

Dear Petr,

thank you for your input.
I tried to experiment with (probably somewhat biased) truncated means 
like in the following code.
How I got the 225 as a truncation limit is a good question. :)

REPS1 <- REPS2 <- 1000
N1 <- 100000
N2 <- 30000
N <- N1+N2
x1 <- rep(0,N1)
x2 <- rnorm(N2,300,100)
x <- c(x1,x2)

n <- 1000

for (i in 1:REPS1){
   x_sample <- sort(sample(x,n,replace=FALSE),TRUE)
   x_trunc <- x_sample[1:225]
   REPS1[i] <- mean(x_sample)*N
   REPS2[i] <- sum(x_trunc)/n*N
   }

sum(x2)
mean(REPS1)
mean(REPS2)
sd(REPS1)
sd(REPS2)
sd(REPS2)/sd(REPS1)

Best,
daniel

2012.05.03. 17:45 keltezéssel, Petr Savicky írta:
> On Thu, May 03, 2012 at 03:08:00PM +0200, Kehl Dániel wrote:
>> Dear List-members,
>>
>> I have a problem where I have to estimate a mean, or a sum of a
>> population but for some reason it contains a huge amount of zeros.
>> I cannot give real data but I constructed a toy example as follows
>>
>> N1<- 100000
>> N2<- 3000
>> x1<- rep(0,N1)
>> x2<- rnorm(N2,300,100)
>> x<- c(x1,x2)
>>
>> n<- 1000
>>
>> x_sample<- sample(x,n,replace=FALSE)
>>
>> I want to estimate the sum of x based on x_sample (not knowing N1 and N2
>> but their sum (N) only).
>> The sample mean has a huge standard deviation I am looking for a better
>> estimator.
> Hi.
>
> I do not know the exact answer, but let me formulate the following observation.
> If the question is redefined to estimate the mean of nonzero numbers, then
> an estimate is mean(x_sample[x_sample != 0]). Its standard deviation in your
> situation may be estimated as
>
>    res<- rep(NA, times=1000)
>    for (i in seq.int(along=res)) {
>        x_sample<- sample(x,n,replace=FALSE)
>        res[i]<- mean(x_sample[x_sample != 0])
>    }
>    sd(res)
>
>    [1] 18.72677 # this varies with the seed a bit
>
> The observation is that this cannot be improved much, since the estimate
> is based on a very small sample. The average size of the sample of nonzero
> values is N2/(N1+N2)*n = 29.1. So, the standard deviation should be
> something close to 100/sqrt(29.1) = 18.5376.
>
> Petr Savicky.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.