[R] quantreg speed

Sun Nov 16 19:49:16 CET 2014

Hi Roger,

Thank you for your reply. To my understanding, changing the regression method only helps to speed up the computation, but not necessarily solve the problem with 99th percentile that p-values for all the factors are 1.0. I wonder how I should interpret the result for 99th percentile, while the results for other percentiles seem to work fine.

Correct me if I’m wrong.

Thank you!

Yunqi
On Nov 16, 2014, at 8:42 AM, Roger <rkoenker at illinois.edu> wrote:

> You could try method = "pin".  
> 
> Sent from my iPhone
> 
>> On Nov 16, 2014, at 1:40 AM, Yunqi Zhang <yqzhang at ucsd.edu> wrote:
>> 
>> Hi William,
>> 
>> Thank you very much for your reply.
>> 
>> I did a subsampling to reduce the number of samples to ~1.8 million. It
>> seems to work fine except for 99th percentile (p-values for all the
>> features are 1.0). Does this mean I’m subsampling too much? How should I
>> interpret the result?
>> 
>> tau: [1] 0.25
>> 
>> 
>> 
>> Coefficients:
>> 
>>              Value      Std. Error t value    Pr(>|t|)
>> 
>> (Intercept)      72.15700    0.03651 1976.10513    0.00000
>> 
>> f1            -0.51000    0.04906  -10.39508    0.00000
>> 
>> f2            -20.44200    0.03933 -519.78766    0.00000
>> 
>> f3              -2.37000    0.04871  -48.65117    0.00000
>> 
>> f1:f2       -2.52500    0.05315  -47.50361    0.00000
>> 
>> f1:f3         1.03600    0.06573   15.76193    0.00000
>> 
>> f2:f3          3.41300    0.05247   65.05075    0.00000
>> 
>> f1:f2:f3   -0.83800    0.07120  -11.77002    0.00000
>> 
>> 
>> 
>> Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *
>> 
>>   f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,
>> 
>>   0.75, 0.9, 0.95, 0.99), data = data_stats)
>> 
>> 
>> 
>> tau: [1] 0.5
>> 
>> 
>> 
>> Coefficients:
>> 
>>              Value      Std. Error t value    Pr(>|t|)
>> 
>> (Intercept)      83.80900    0.05626 1489.61222    0.00000
>> 
>> f1            -0.92200    0.07528  -12.24692    0.00000
>> 
>> f2            -27.90700    0.05937 -470.07189    0.00000
>> 
>> f3              -6.45000    0.07204  -89.53909    0.00000
>> 
>> f1:f2       -2.66500    0.07933  -33.59275    0.00000
>> 
>> f1:f3         1.99000    0.09869   20.16440    0.00000
>> 
>> f2:f3          7.09600    0.07611   93.23813    0.00000
>> 
>> f1:f2:f3   -1.71200    0.10390  -16.47660    0.00000
>> 
>> 
>> 
>> Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *
>> 
>>   f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,
>> 
>>   0.75, 0.9, 0.95, 0.99), data = data_stats)
>> 
>> 
>> 
>> tau: [1] 0.75
>> 
>> 
>> 
>> Coefficients:
>> 
>>              Value      Std. Error t value    Pr(>|t|)
>> 
>> (Intercept)     102.71700    0.10175 1009.45946    0.00000
>> 
>> f1            -1.59300    0.13241  -12.03125    0.00000
>> 
>> f2            -40.64200    0.10623 -382.58456    0.00000
>> 
>> f3             -14.40900    0.12096 -119.11988    0.00000
>> 
>> f1:f2       -2.97600    0.13867  -21.46071    0.00000
>> 
>> f1:f3         3.74600    0.16335   22.93165    0.00000
>> 
>> f2:f3         14.14800    0.12692  111.47217    0.00000
>> 
>> f1:f2:f3   -3.16400    0.17159  -18.43899    0.00000
>> 
>> 
>> 
>> Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *
>> 
>>   f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,
>> 
>>   0.75, 0.9, 0.95, 0.99), data = data_stats)
>> 
>> 
>> 
>> tau: [1] 0.9
>> 
>> 
>> 
>> Coefficients:
>> 
>>              Value      Std. Error t value    Pr(>|t|)
>> 
>> (Intercept)     130.89400    0.20609  635.12464    0.00000
>> 
>> f1            -2.55500    0.28139   -9.07995    0.00000
>> 
>> f2            -60.90500    0.21322 -285.64558    0.00000
>> 
>> f3             -29.42300    0.23409 -125.69092    0.00000
>> 
>> f1:f2       -2.77700    0.29052   -9.55870    0.00000
>> 
>> f1:f3         7.89700    0.33308   23.70870    0.00000
>> 
>> f2:f3         27.78100    0.24338  114.14722    0.00000
>> 
>> f1:f2:f3   -6.95800    0.34491  -20.17327    0.00000
>> 
>> 
>> 
>> Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *
>> 
>>   f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,
>> 
>>   0.75, 0.9, 0.95, 0.99), data = data_stats)
>> 
>> 
>> 
>> tau: [1] 0.95
>> 
>> 
>> 
>> Coefficients:
>> 
>>              Value      Std. Error t value    Pr(>|t|)
>> 
>> (Intercept)     157.45900    0.42733  368.47413    0.00000
>> 
>> f1            -4.10200    0.55834   -7.34678    0.00000
>> 
>> f2            -81.24000    0.44012 -184.58697    0.00000
>> 
>> f3             -46.17500    0.46235  -99.87033    0.00000
>> 
>> f1:f2       -2.01700    0.57651   -3.49866    0.00047
>> 
>> f1:f3        15.67000    0.67409   23.24600    0.00000
>> 
>> f2:f3         43.00100    0.47973   89.63500    0.00000
>> 
>> f1:f2:f3  -14.05100    0.69737  -20.14843    0.00000
>> 
>> 
>> 
>> Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *
>> 
>>   f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,
>> 
>>   0.75, 0.9, 0.95, 0.99), data = data_stats)
>> 
>> 
>> 
>> tau: [1] 0.99
>> 
>> 
>> 
>> Coefficients:
>> 
>>              Value         Std. Error    t value       Pr(>|t|)
>> 
>> (Intercept)     2.544860e+02  3.878303e+07  1.000000e-05  9.999900e-01
>> 
>> f1          -1.420000e+01  5.917548e+11  0.000000e+00  1.000000e+00
>> 
>> f2           -1.582920e+02  3.450261e+07  0.000000e+00  1.000000e+00
>> 
>> f3            -1.139210e+02  4.763057e+07  0.000000e+00  1.000000e+00
>> 
>> f1:f2      5.725000e+00  1.324283e+12  0.000000e+00  1.000000e+00
>> 
>> f1:f3       6.811780e+02  1.153645e+13  0.000000e+00  1.000000e+00
>> 
>> f2:f3        1.042510e+02  2.299953e+24  0.000000e+00  1.000000e+00
>> 
>> f1:f2:f3 -6.763210e+02  2.299953e+24  0.000000e+00  1.000000e+00
>> 
>> Warning message:
>> 
>> In summary.rq(xi, ...) : 288000 non-positive fis
>> 
>>> On Sat, Nov 15, 2014 at 8:19 PM, William Dunlap <wdunlap at tibco.com> wrote:
>>> 
>>> You can time it yourself on increasingly large subsets of your data.  E.g.,
>>> 
>>>> dat <- data.frame(x1=rnorm(1e6), x2=rnorm(1e6),
>>> x3=sample(c("A","B","C"),size=1e6,replace=TRUE))
>>>> dat$y <- with(dat, x1 + 2*(x3=="B")*x2 + rnorm(1e6))
>>>> t <- vapply(n<-4^(3:10),FUN=function(n){d<-dat[seq_len(n),];
>>> print(system.time(rq(data=d, y ~ x1 + x2*x3,
>>> tau=0.9)))},FUN.VALUE=numeric(5))
>>>  user  system elapsed
>>>     0       0       0
>>>  user  system elapsed
>>>     0       0       0
>>>  user  system elapsed
>>>  0.02    0.00    0.01
>>>  user  system elapsed
>>>  0.01    0.00    0.02
>>>  user  system elapsed
>>>  0.10    0.00    0.11
>>>  user  system elapsed
>>>  1.09    0.00    1.10
>>>  user  system elapsed
>>> 13.05    0.02   13.07
>>>  user  system elapsed
>>> 273.30    0.11  273.74
>>>> t
>>>          [,1] [,2] [,3] [,4] [,5] [,6]  [,7]   [,8]
>>> user.self     0    0 0.02 0.01 0.10 1.09 13.05 273.30
>>> sys.self      0    0 0.00 0.00 0.00 0.00  0.02   0.11
>>> elapsed       0    0 0.01 0.02 0.11 1.10 13.07 273.74
>>> user.child   NA   NA   NA   NA   NA   NA    NA     NA
>>> sys.child    NA   NA   NA   NA   NA   NA    NA     NA
>>> 
>>> Do some regressions on t["elapsed",] as a function of n and predict up to
>>> n=10^7.  E.g.,
>>>> summary(lm(t["elapsed",] ~ poly(n,4)))
>>> 
>>> Call:
>>> lm(formula = t["elapsed", ] ~ poly(n, 4))
>>> 
>>> Residuals:
>>>        1          2          3          4          5          6
>>> 7          8
>>> -2.375e-03 -2.970e-03  4.484e-03  1.674e-03 -8.723e-04  6.096e-05
>>> -9.199e-07  2.715e-09
>>> 
>>> Coefficients:
>>>            Estimate Std. Error  t value Pr(>|t|)
>>> (Intercept) 3.601e+01  1.261e-03 28564.33 9.46e-14 ***
>>> poly(n, 4)1 2.493e+02  3.565e-03 69917.04 6.45e-15 ***
>>> poly(n, 4)2 5.093e+01  3.565e-03 14284.61 7.57e-13 ***
>>> poly(n, 4)3 1.158e+00  3.565e-03   324.83 6.43e-08 ***
>>> poly(n, 4)4 4.392e-02  3.565e-03    12.32  0.00115 **
>>> ---
>>> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>>> 
>>> Residual standard error: 0.003565 on 3 degrees of freedom
>>> Multiple R-squared:      1,     Adjusted R-squared:      1
>>> F-statistic: 1.273e+09 on 4 and 3 DF,  p-value: 3.575e-14
>>> 
>>> 
>>> It does not look good for n=10^7.
>>> 
>>> 
>>> 
>>> Bill Dunlap
>>> TIBCO Software
>>> wdunlap tibco.com
>>> 
>>>> On Sat, Nov 15, 2014 at 12:12 PM, Yunqi Zhang <yqzhang at ucsd.edu> wrote:
>>>> 
>>>> Hi all,
>>>> 
>>>> I'm using quantreg rq() to perform quantile regression on a large data
>>>> set.
>>>> Each record has 4 fields and there are about 18 million records in total.
>>>> I
>>>> wonder if anyone has tried rq() on a large dataset and how long I should
>>>> expect it to finish. Or it is simply too large and I should subsample the
>>>> data. I would like to have an idea before I start to run and wait forever.
>>>> 
>>>> In addition, I will appreciate if anyone could give me an idea how long it
>>>> takes for rq() to run approximately for certain dataset size.
>>>> 
>>>> Yunqi
>>>> 
>>>>       [[alternative HTML version deleted]]
>>>> 
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>> 
>>   [[alternative HTML version deleted]]
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.