[R] Partition vector of strings into lines of preferred width

Sat Oct 29 00:07:42 CEST 2022

Dear Andrew,

Thank you for the fast reply. I forgot about strwrap. Though my problem 
is slightly different.

I do have the actual vector. Of course, I could first join the strings - 
but this then involves both a join and a split (done by strwrap). Maybe 
its possible to avoid the join and the split. My 2nd approach may be 
also fine, but I have not tested it thoroughly (and I may miss an 
existing solution).

Sincerely,

Leonard

On 10/29/2022 12:51 AM, Andrew Simmons wrote:
> I would suggest using strwrap(), the documentation at ?strwrap has
> plenty of details and examples.
> For paragraphs, I would usually do something like:
>
> strwrap(x = , width = 80, indent = 4)
>
> On Fri, Oct 28, 2022 at 5:42 PM Leonard Mada via R-help
> <r-help using r-project.org> wrote:
>> Dear R-Users,
>>
>> text = "
>> What is the best way to split/cut a vector of strings into lines of
>> preferred width?
>> I have come up with a simple solution, albeit naive, as it involves many
>> arithmetic divisions.
>> I have an alternative idea which avoids this problem.
>> But I may miss some existing functionality!"
>>
>> # Long vector of strings:
>> str = strsplit(text, " |(?<=\n)", perl=TRUE)[[1]];
>> lenWords = nchar(str);
>>
>> # simple, but naive solution:
>> # - it involves many divisions;
>> cut.character.int = function(n, w) {
>>       ncm = cumsum(n);
>>       nwd = ncm %/% w;
>>       count = rle(nwd)$lengths;
>>       pos = cumsum(count);
>>       posS = pos[ - length(pos)] + 1;
>>       posS = c(1, posS);
>>       pos = rbind(posS, pos);
>>       return(pos);
>> }
>>
>> npos = cut.character.int(lenWords, w=30);
>> # lets print the results;
>> for(id in seq(ncol(npos))) {
>>      len = npos[2, id] - npos[1, id];
>>      cat(str[seq(npos[1, id], npos[2, id])], c(rep(" ", len), "\n"));
>> }
>>
>>
>> The first solution performs an arithmetic division on all string
>> lengths. It is possible to find out the total length and divide only the
>> last element of the cumsum. Something like this should work (although it
>> is not properly tested).
>>
>>
>> w = 30;
>> cumlen = cumsum(lenWords);
>> max = tail(cumlen, 1) %/% w + 1;
>> pos = cut(cumlen, seq(0, max) * w);
>> count = rle(as.numeric(pos))$lengths;
>> # everything else is the same;
>> pos = cumsum(count);
>> posS = pos[ - length(pos)] + 1;
>> posS = c(1, posS);
>> pos = rbind(posS, pos);
>>
>> npos = pos; # then print
>>
>>
>> The cut() may be optimized as well, as the cumsum is sorted ascending. I
>> did not evaluate the efficiency of the code either.
>>
>> But do I miss some existing functionality?
>>
>>
>> Note:
>>
>> - technically, the cut() function should probably return a vector of
>> indices (something like: rep(seq_along(count), count)), but it was more
>> practical to have both the start and end positions.
>>
>>
>> Many thanks,
>>
>>
>> Leonard
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.