[R] R 2.10.0: Error in gsub/calloc

Fri Nov 6 07:43:05 CET 2009

Bert,

Thanks for the tip.  Yes, strsplit works, and works fast!  For me,  
white-space tokenization means splitting at the white spaces, so the  
"^" and the outermost square brackets should/can be omitted.

Regards ... from Basel to South San Francisco,
Richard

On Nov 3, 2009, at 22:03 , Bert Gunter wrote:

> Try:
>
> tokens <- strsplit(d,"[^[:space:]]+")
>
> This splits each "sentence" in your vector into a vector of groups of
> whitespace characters that you can then play with as you described,  
> I think
> (The results is a list of such vectors -- see strsplit()).
>
> ## example:
>
>> x <- "xx  xdfg; *&^%kk    "
>
>> strsplit(x,"[^[:blank:]]+")
> [[1]]
> [1] ""     "  "   " "    "    "
>
>
> You might have to use PERL = TRUE and "\\w+" depending on your  
> locale and
> what "[:space:]" does there.
>
> If this works, it should be way faster than strapply() and should  
> not have
> any memory allocation issues either.
>
> HTH.
>
> Bert Gunter
> Genentech Nonclinical Biostatistics
>
>
>
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org 
> ] On
> Behalf Of Richard R. Liu
> Sent: Tuesday, November 03, 2009 11:32 AM
> To: Uwe Ligges
> Cc: r-help at r-project.org
> Subject: Re: [R] R 2.10.0: Error in gsub/calloc
>
> I apologize for not being clear.  d is a character vector of length
> 158908.  Each element in the vector has been designated by sentDetect
> (package: openNLP) as a sentence.  Some of these are really
> sentences.  Others are merely groups of meaningless characters
> separated by white space.  strapply is a function in the package
> gosubfn.  It applies to each element of the first argument the regular
> expression (second argument).  Every match is then sent to the
> designated function (third argument, in my case missing, hence the
> identity function).  Thus, with strapply I am simply performing a
> white-space tokenization of each sentence.  I am doing this in the
> hope of being able to distinguish true sentences from false ones on
> the basis of mean length of token, maximum length of token, or  
> similar.
>
> Richard R. Liu
> Dittingerstr. 33
> CH-4053 Basel
> Switzerland
>
> Tel.:  +41 61 331 10 47
> Email:  richard.liu at pueo-owl.ch
>
>
> On Nov 3, 2009, at 18:30 , Uwe Ligges wrote:
>
>>
>>
>> richard.liu at pueo-owl.ch wrote:
>>> I'm running R 2.10.0 under Mac OS X 10.5.8; however, I don't think
>>> this
>>> is a Mac-specific problem.
>>> I have a very large (158,908 possible sentences, ca. 58 MB) plain
>>> text
>>> document d which I am
>>> trying to tokenize:  t <- strapply(d, "\\w+", perl = T).  I am
>>> encountering the following error:
>>
>>
>> What is strapply() and what is d?
>>
>> Uwe Ligges
>>
>>
>>
>>
>>> Error in base::gsub(pattern, rs, x, ...) :
>>> Calloc could not allocate (-1398215180 of 1) memory
>>> This happens regardless of whether I run in 32- or 64-bit mode.  The
>>> machine has 8 GB of RAM, so
>>> I can hardly believe that RAM is a problem.
>>> Thanks,
>>> Richard
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>