[R] R 2.10.0: Error in gsub/calloc

Richard R. Liu richard.liu at pueo-owl.ch
Fri Nov 6 16:01:02 CET 2009


Gabor,

What about the error message that I got with strapply?  That seemed to be the
same kind of problem (i.e., integer overflow of index) as with gsub.

Regards,
Richard

On Fri, 6 Nov 2009 08:00:06 -0500, Gabor Grothendieck wrote
> Note that strapply without perl = TRUE runs an order of magnitude
> faster than with perl = TRUE and takes nearly the same set of regular
> expressions anyways since its default is tcl regular expressions.
> strsplit should still be fastest where it applies since splitting is
> its only purpose.
> 
> On Fri, Nov 6, 2009 at 1:43 AM, Richard R. Liu <richard.liu at pueo-
> owl.ch> wrote:
> > Bert,
> >
> > Thanks for the tip.  Yes, strsplit works, and works fast!  For me,
> > white-space tokenization means splitting at the white spaces, so the "^" and
> > the outermost square brackets should/can be omitted.
> >
> > Regards ... from Basel to South San Francisco,
> > Richard
> >
> > On Nov 3, 2009, at 22:03 , Bert Gunter wrote:
> >
> >> Try:
> >>
> >> tokens <- strsplit(d,"[^[:space:]]+")
> >>
> >> This splits each "sentence" in your vector into a vector of groups of
> >> whitespace characters that you can then play with as you described, I
> >> think
> >> (The results is a list of such vectors -- see strsplit()).
> >>
> >> ## example:
> >>
> >>> x <- "xx  xdfg; *&^%kk    "
> >>
> >>> strsplit(x,"[^[:blank:]]+")
> >>
> >> [[1]]
> >> [1] ""     "  "   " "    "    "
> >>
> >>
> >> You might have to use PERL = TRUE and "\\w+" depending on your locale and
> >> what "[:space:]" does there.
> >>
> >> If this works, it should be way faster than strapply() and should not have
> >> any memory allocation issues either.
> >>
> >> HTH.
> >>
> >> Bert Gunter
> >> Genentech Nonclinical Biostatistics
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
> >> On
> >> Behalf Of Richard R. Liu
> >> Sent: Tuesday, November 03, 2009 11:32 AM
> >> To: Uwe Ligges
> >> Cc: r-help at r-project.org
> >> Subject: Re: [R] R 2.10.0: Error in gsub/calloc
> >>
> >> I apologize for not being clear.  d is a character vector of length
> >> 158908.  Each element in the vector has been designated by sentDetect
> >> (package: openNLP) as a sentence.  Some of these are really
> >> sentences.  Others are merely groups of meaningless characters
> >> separated by white space.  strapply is a function in the package
> >> gosubfn.  It applies to each element of the first argument the regular
> >> expression (second argument).  Every match is then sent to the
> >> designated function (third argument, in my case missing, hence the
> >> identity function).  Thus, with strapply I am simply performing a
> >> white-space tokenization of each sentence.  I am doing this in the
> >> hope of being able to distinguish true sentences from false ones on
> >> the basis of mean length of token, maximum length of token, or similar.
> >>
> >> Richard R. Liu
> >> Dittingerstr. 33
> >> CH-4053 Basel
> >> Switzerland
> >>
> >> Tel.:  +41 61 331 10 47
> >> Email:  richard.liu at pueo-owl.ch
> >>
> >>
> >> On Nov 3, 2009, at 18:30 , Uwe Ligges wrote:
> >>
> >>>
> >>>
> >>> richard.liu at pueo-owl.ch wrote:
> >>>>
> >>>> I'm running R 2.10.0 under Mac OS X 10.5.8; however, I don't think
> >>>> this
> >>>> is a Mac-specific problem.
> >>>> I have a very large (158,908 possible sentences, ca. 58 MB) plain
> >>>> text
> >>>> document d which I am
> >>>> trying to tokenize:  t <- strapply(d, "\\w+", perl = T).  I am
> >>>> encountering the following error:
> >>>
> >>>
> >>> What is strapply() and what is d?
> >>>
> >>> Uwe Ligges
> >>>
> >>>
> >>>
> >>>
> >>>> Error in base::gsub(pattern, rs, x, ...) :
> >>>> Calloc could not allocate (-1398215180 of 1) memory
> >>>> This happens regardless of whether I run in 32- or 64-bit mode.  The
> >>>> machine has 8 GB of RAM, so
> >>>> I can hardly believe that RAM is a problem.
> >>>> Thanks,
> >>>> Richard
> >>>> ______________________________________________
> >>>> R-help at r-project.org mailing list
> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>> PLEASE do read the posting guide
> >>
> >> http://www.R-project.org/posting-guide.html
> >>>>
> >>>> and provide commented, minimal, self-contained, reproducible code.
> >>
> >
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> >


--
Richard R. Liu
Dittingerstr. 33
CH-4053 Basel
Switzerland

Tel.:  +41 61 331 10 47
Email:  richard.liu at pueo-owl.ch




More information about the R-help mailing list