[R] Biglm source code alternatives (E.g. Call to Fortran)

Tue Jan 3 06:09:22 CET 2012

Hi everyone,

I have been looking at the Bigglm (Basically does Generalised Linear Models
for big data under the Biglm package) command and I have done some profiling
on this code and found that to do a GLM on a 100mb file (9 million rows by 5
columns matrix(most of the numbers were either a 0,1 or 2 randomly
generated)) it took about 2 minutes on a linux machine with 8gb of RAM and 4
cores. Ideally I want to run this much quicker probably around 30 seconds to
60 seconds and after viewing the profiling code I noticed these things:

> summaryRprof('method2.out')
$by.self
                                   self.time self.pct total.time total.pct
"model.matrix.default"      24.84     19.4      26.40      20.6
".Call"                             21.00     16.4      21.00      16.4
"as.character"                  17.92     14.0      17.92      14.0
"[.data.frame"                  14.04     11.0      22.54      17.6
"*"                                  6.44      5.0       6.44       5.0
"update.bigqr"                  5.34      4.2      15.32      12.0
"-"                                  4.52      3.5       4.52       3.5
"anyDuplicated.default"      4.12      3.2       4.12       3.2
"/"                                  3.76      2.9       3.76       2.9
"attr"                              3.26      2.5       3.26       2.5
"|"                                  2.96      2.3       2.96       2.3
"unclass"                         2.82      2.2       2.82       2.2
"na.omit"                        2.42      1.9      17.18      13.4
"sum"                              2.02      1.6       2.02       1.6

I did some further investigation and it appears the .Call command to fortran
seems slow. This function is under the coef.bigqr.R and singcheck.bigqr.R
functions in the biglm package. Is there an alternative way to implement the
call to Fortran? As I thought matrix inversion and QR/Cholesky decomposition
can be done much faster on low level design software platforms like Fortran
so I was surprised by the 21 second timeframe. Furthermore are there any
other packages or platforms I can implement to speed up the as.character or
model.matrix commands. My expertise in R is very limited but I realise R
also has the ability to do parallel computing. Is this also a possible
solution to running a GLM on a big dataset very quickly. Alternatively I
could increase memory and add more cores but this isn't really a long term
solution as I know that I will eventually work with bigger datasets. In fact
GLM is such a common tool that I think this would benefit a lot of people in
the R community if it could be run quicker for bigger data using existing
packages such as ff, doMC, parallel, biglm, bigmemory. Your help would be
greatly appreciated,

hardworker

--
View this message in context: http://r.789695.n4.nabble.com/Biglm-source-code-alternatives-E-g-Call-to-Fortran-tp4255774p4255774.html
Sent from the R help mailing list archive at Nabble.com.