[R] Partition vector of strings into lines of preferred width

Leonard Mada |eo@m@d@ @end|ng |rom @yon|c@eu
Fri Oct 28 23:42:13 CEST 2022


Dear R-Users,

text = "
What is the best way to split/cut a vector of strings into lines of 
preferred width?
I have come up with a simple solution, albeit naive, as it involves many 
arithmetic divisions.
I have an alternative idea which avoids this problem.
But I may miss some existing functionality!"

# Long vector of strings:
str = strsplit(text, " |(?<=\n)", perl=TRUE)[[1]];
lenWords = nchar(str);

# simple, but naive solution:
# - it involves many divisions;
cut.character.int = function(n, w) {
     ncm = cumsum(n);
     nwd = ncm %/% w;
     count = rle(nwd)$lengths;
     pos = cumsum(count);
     posS = pos[ - length(pos)] + 1;
     posS = c(1, posS);
     pos = rbind(posS, pos);
     return(pos);
}

npos = cut.character.int(lenWords, w=30);
# lets print the results;
for(id in seq(ncol(npos))) {
    len = npos[2, id] - npos[1, id];
    cat(str[seq(npos[1, id], npos[2, id])], c(rep(" ", len), "\n"));
}


The first solution performs an arithmetic division on all string 
lengths. It is possible to find out the total length and divide only the 
last element of the cumsum. Something like this should work (although it 
is not properly tested).


w = 30;
cumlen = cumsum(lenWords);
max = tail(cumlen, 1) %/% w + 1;
pos = cut(cumlen, seq(0, max) * w);
count = rle(as.numeric(pos))$lengths;
# everything else is the same;
pos = cumsum(count);
posS = pos[ - length(pos)] + 1;
posS = c(1, posS);
pos = rbind(posS, pos);

npos = pos; # then print


The cut() may be optimized as well, as the cumsum is sorted ascending. I 
did not evaluate the efficiency of the code either.

But do I miss some existing functionality?


Note:

- technically, the cut() function should probably return a vector of 
indices (something like: rep(seq_along(count), count)), but it was more 
practical to have both the start and end positions.


Many thanks,


Leonard



More information about the R-help mailing list