[R] Limitations and scale of R, and performance issues if and when limit reached

Sat Oct 23 17:42:58 CEST 2010

On 21.10.2010 22:20, Stratos Laskarides wrote:
>   Hi there
>
> Thank you for everyone's help in all my previous questions.
>
> By way of intro, I am a masters student in actuarial science at the
> University of Cape Town, and I am doing a project in R on some healthcare
> cost data. Just for clarity before I embark on further research may I please
> ask the following.
>
> I want to take the direction of modelling healh insurance claims data with
> Tweedie compound poisson models for over 2 million beneficiaries. I'd also
> like to work in a double GLM framework so that the dispersion parameter
> captures as much variance as possible. In addition, I'd like these results
> to somehow feed into a stochastic model application, which will form part of
> a Dynamic Financial Analysis model of a health insurer.
>
> My question is, in light of the above broad overview, how large must data
> sets be before R faces any performance problems or issues? In other words
> what "scale" can R handle?

Depends on the available memory, the kind of data and the methods you 
are going to apply.

Uwe Ligges

> Thanks ever so much once again.
>
> Kind regards
> Stratos
>
>   On Tue, Oct 12, 2010 at 11:31 AM, Dennis Murphy<djmuser at gmail.com>  wrote:
>
>> Hi:
>>
>>   On Tue, Oct 12, 2010 at 12:51 AM, Stratos Laskarides<stratlask at gmail.com
>>> wrote:
>>
>>>   Dear Madam/Sir
>>>
>>> This may be quite a long shot...
>>>
>>> By way of intro, I am a masters student in actuarial science at the
>>> University of Cape Town, and I am doing a project in R on some healthcare
>>> cost data. During my coding in R I encountered an error message, which I
>>> then googled, but I am still unable to resolve the issue.
>>>
>>> I would like to please ask if and how it is possible to resolve the
>>> problem
>>> raised by the error message "Error: NA/NaN/Inf in foreign function call
>>> (arg
>>> 1) In addition: Warning message: *step size truncated due to divergence"
>>> *in
>>> R?
>>>
>>
>> That error message can arise if division by zero occurs somewhere in the
>> computation. Try using ftable() or some related function that will print
>> out your
>> complete table (4-way?) and check whether you have zero frequency in one
>> or more cells. If there are zero frequencies, that does not necessarily
>> explain
>> the problem, but it's a reasonable initial hypothesis. Merging some
>> categories to
>> get enough frequencies per cell may be useful if you do have zero
>> frequencies,
>> and then try the fit again to see if you get more sensible results.
>>
>> When the error is thrown, it can be useful to do
>> traceback()
>>
>> as it recalls the sequence of function calls that led up to the error, but
>> it helps to
>> have enough R experience to make heads or tails of the output :)
>>
>>>
>>> As for some background on my specific data and research problem at hand, I
>>> am fitting a gamma regression model to 13 000 lines of insurance claims
>>> data, which will be regressed against categorical variables such as Age
>>> Band, Gender, and Region.
>>>
>>
>> The more variables you have in the model, the greater the number of cell
>> combinations. A 15 x 2 x 5 combination of your three variables, for
>> example, would generate 150 combinations of the three variables, and it's
>> entirely possible for a few of those combinations to have small or zero
>> frequencies.
>> In addition, adding a new variable to the model would at least double the
>> number
>> of cells, spreading/thinning out the data even more.
>>
>>>
>>> Perhaps my problem arises because the data set is too large and the
>>> iteratively reweighted least squares algorithm therefore cannot converge,
>>> in
>>> which case I perhaps need another GLM type. Or maybe the categorical
>>> explanatory variables can take on too many values (e.g. there are 15 Age
>>> Bands, 5 Regions).
>>>
>>
>> If your response is continuous and positive valued with a right skewed
>> distribution,
>> then a Gamma model would appear to be sensible.
>>
>> The data set is not too large; successful GLMs have been fit with much
>> larger
>> data sets. Your second hypothesis sounds more plausible, though.
>>
>> HTH,
>> Dennis
>>
>>>
>>> Any insights you could provide would be much appreciated.
>>>
>>> Thank you ever so much.
>>>
>>> Kind regards
>>> Stratos Laskarides
>>> South Africa
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.