[R] Logistic Regression with 200K features in R?

Duncan Murdoch murdoch.duncan at gmail.com
Thu Dec 12 17:09:37 CET 2013


On 12/12/2013 7:08 AM, Eik Vettorazzi wrote:
> thanks Duncan for this clarification.
> A double precision matrix with 2e11 elements (as the op wanted) would
> need about 1.5 TB memory, that's more than a standard (windows 64bit)
> computer can handle.

According to Microsoft's "Memory Limits" web page (currently at 
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366778%28v=vs.85%29.aspx#memory_limits, 
but these things tend to move around), the limit is 8 TB for virtual 
memory.    (The same page lists a variety of smaller physical memory 
limits, depending on the Windows version, but R doesn't need physical 
memory, virtual is good enough. )

R would be very slow if it was working with objects bigger than physical 
memory, but it could conceivably work.

Duncan Murdoch

> Cheers.
>
> Am 12.12.2013 13:00, schrieb Duncan Murdoch:
> > On 13-12-12 6:51 AM, Eik Vettorazzi wrote:
> >> I thought so (with all the limitations due to collinearity and so on),
> >> but actually there is a limit for the maximum size of an array which is
> >> independent of your memory size and is due to the way arrays are
> >> indexed. You can't create an object with more than 2^31-1 = 2147483647
> >> elements.
> >>
> >> https://stat.ethz.ch/pipermail/r-help/2007-June/133238.html
> >
> > That post is from 2007.  The limits were raised considerably when R
> > 3.0.0 was released, and it is now 2^48 for disk-based operations, 2^52
> > for working in memory.
> >
> > Duncan Murdoch
> >
> >
> >>
> >> cheers
> >>
> >> Am 12.12.2013 12:34, schrieb Romeo Kienzler:
> >>> ok, so 200K predictors an 10M observations would work?
> >>>
> >>>
> >>> On 12/12/2013 12:12 PM, Eik Vettorazzi wrote:
> >>>> it is simply because you can't do a regression with more predictors
> >>>> than
> >>>> observations.
> >>>>
> >>>> Cheers.
> >>>>
> >>>> Am 12.12.2013 09:00, schrieb Romeo Kienzler:
> >>>>> Dear List,
> >>>>>
> >>>>> I'm quite new to R and want to do logistic regression with a 200K
> >>>>> feature data set (around 150 training examples).
> >>>>>
> >>>>> I'm aware that I should use Naive Bayes but I have a more general
> >>>>> question about the capability of R handling very high dimensional
> >>>>> data.
> >>>>>
> >>>>> Please consider the following R code where "mygenestrain.tab" is a 150
> >>>>> by 200000 matrix:
> >>>>>
> >>>>> traindata <- read.table('mygenestrain.tab');
> >>>>> mylogit <- glm(V1 ~ ., data = traindata, family = "binomial");
> >>>>>
> >>>>> When executing this code I get the following error:
> >>>>>
> >>>>> Error in terms.formula(formula, data = data) :
> >>>>>     allocMatrix: too many elements specified
> >>>>> Calls: glm ... model.frame -> model.frame.default -> terms ->
> >>>>> terms.formula
> >>>>> Execution halted
> >>>>>
> >>>>> Is this because R can't handle 200K features or am I doing something
> >>>>> completely wrong here?
> >>>>>
> >>>>> Thanks a lot for your help!
> >>>>>
> >>>>> best Regards,
> >>>>>
> >>>>> Romeo
> >>>>>
> >>>>> ______________________________________________
> >>>>> R-help at r-project.org mailing list
> >>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>> PLEASE do read the posting guide
> >>>>> http://www.R-project.org/posting-guide.html
> >>>>> and provide commented, minimal, self-contained, reproducible code.
> >>>
> >>
> >
>



More information about the R-help mailing list