[R] quantreg speed

Sun Nov 16 02:19:57 CET 2014

You can time it yourself on increasingly large subsets of your data.  E.g.,

> dat <- data.frame(x1=rnorm(1e6), x2=rnorm(1e6),
x3=sample(c("A","B","C"),size=1e6,replace=TRUE))
> dat$y <- with(dat, x1 + 2*(x3=="B")*x2 + rnorm(1e6))
> t <- vapply(n<-4^(3:10),FUN=function(n){d<-dat[seq_len(n),];
print(system.time(rq(data=d, y ~ x1 + x2*x3,
tau=0.9)))},FUN.VALUE=numeric(5))
   user  system elapsed
      0       0       0
   user  system elapsed
      0       0       0
   user  system elapsed
   0.02    0.00    0.01
   user  system elapsed
   0.01    0.00    0.02
   user  system elapsed
   0.10    0.00    0.11
   user  system elapsed
   1.09    0.00    1.10
   user  system elapsed
  13.05    0.02   13.07
   user  system elapsed
 273.30    0.11  273.74
> t
           [,1] [,2] [,3] [,4] [,5] [,6]  [,7]   [,8]
user.self     0    0 0.02 0.01 0.10 1.09 13.05 273.30
sys.self      0    0 0.00 0.00 0.00 0.00  0.02   0.11
elapsed       0    0 0.01 0.02 0.11 1.10 13.07 273.74
user.child   NA   NA   NA   NA   NA   NA    NA     NA
sys.child    NA   NA   NA   NA   NA   NA    NA     NA

Do some regressions on t["elapsed",] as a function of n and predict up to
n=10^7.  E.g.,
> summary(lm(t["elapsed",] ~ poly(n,4)))

Call:
lm(formula = t["elapsed", ] ~ poly(n, 4))

Residuals:
         1          2          3          4          5          6
 7          8
-2.375e-03 -2.970e-03  4.484e-03  1.674e-03 -8.723e-04  6.096e-05
-9.199e-07  2.715e-09

Coefficients:
             Estimate Std. Error  t value Pr(>|t|)
(Intercept) 3.601e+01  1.261e-03 28564.33 9.46e-14 ***
poly(n, 4)1 2.493e+02  3.565e-03 69917.04 6.45e-15 ***
poly(n, 4)2 5.093e+01  3.565e-03 14284.61 7.57e-13 ***
poly(n, 4)3 1.158e+00  3.565e-03   324.83 6.43e-08 ***
poly(n, 4)4 4.392e-02  3.565e-03    12.32  0.00115 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.003565 on 3 degrees of freedom
Multiple R-squared:      1,     Adjusted R-squared:      1
F-statistic: 1.273e+09 on 4 and 3 DF,  p-value: 3.575e-14

It does not look good for n=10^7.

Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Sat, Nov 15, 2014 at 12:12 PM, Yunqi Zhang <yqzhang at ucsd.edu> wrote:

> Hi all,
>
> I'm using quantreg rq() to perform quantile regression on a large data set.
> Each record has 4 fields and there are about 18 million records in total. I
> wonder if anyone has tried rq() on a large dataset and how long I should
> expect it to finish. Or it is simply too large and I should subsample the
> data. I would like to have an idea before I start to run and wait forever.
>
> In addition, I will appreciate if anyone could give me an idea how long it
> takes for rq() to run approximately for certain dataset size.
>
> Yunqi
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]