[R] sorting variable names containing digits

Mon Dec 22 03:57:32 CET 2008

Dear Gabor,

Thanks for this -- I was unaware of mixedsort(). As you point out,
however, mixedsort() doesn't cover all of the cases in which I'm
interested and which are handled by mysort().

Regards,
 John

On Sun, 21 Dec 2008 20:51:17 -0500
 "Gabor Grothendieck" <ggrothendieck at gmail.com> wrote:
> mixedsort in gtools will give the same result as mysort(s) but
> differs in the case of t.
> 
> On Sun, Dec 21, 2008 at 8:33 PM, John Fox <jfox at mcmaster.ca> wrote:
> > Dear r-helpers,
> >
> > I'm looking for a way of sorting variable names in a "natural"
> order, when
> > the names are composed of digits and other characters. I know that
> this is a
> > vague idea, and that sorting character strings is a complex topic,
> but
> > perhaps a couple of examples will clarify what I mean:
> >
> >> s <- c("x1b", "x1a", "x02b", "x02a", "x02", "y1a1", "y10a2",
> > +   "y10a10", "y10a1", "y2", "var10a2", "var2", "y10")
> >
> >> sort(s)
> >  [1] "var10a2" "var2"    "x02"     "x02a"    "x02b"    "x1a"
> >  [7] "x1b"     "y10"     "y10a1"   "y10a10"  "y10a2"   "y1a1"
> > [13] "y2"
> >
> >> mysort(s)
> >  [1] "var2"    "var10a2" "x1a"     "x1b"     "x02"     "x02a"
> >  [7] "x02b"    "y1a1"    "y2"      "y10"     "y10a1"   "y10a2"
> > [13] "y10a10"
> >
> >> t <- c("q10.1.1", "q10.2.1", "q2.1.1", "q10.10.2")
> >
> >> sort(t)
> > [1] "q10.1.1"  "q10.10.2" "q10.2.1"  "q2.1.1"
> >
> >> mysort(t)
> > [1] "q2.1.1"   "q10.1.1"  "q10.2.1"  "q10.10.2"
> >
> > Here, sort() is the standard R function and mysort() is a
> replacement, which
> > sorts the names into the order that seems natural to me, at least
> in the
> > cases that I've tried:
> >
> > mysort <- function(x){
> >  sort.helper <- function(x){
> >    prefix <- strsplit(x, "[0-9]")
> >    prefix <- sapply(prefix, "[", 1)
> >    prefix[is.na(prefix)] <- ""
> >    suffix <- strsplit(x, "[^0-9]")
> >    suffix <- as.numeric(sapply(suffix, "[", 2))
> >    suffix[is.na(suffix)] <- -Inf
> >    remainder <- sub("[^0-9]+", "", x)
> >    remainder <- sub("[0-9]+", "", remainder)
> >    if (all (remainder == "")) list(prefix, suffix)
> >    else c(list(prefix, suffix), Recall(remainder))
> >    }
> >  ord <- do.call("order", sort.helper(x))
> >  x[ord]
> >   }
> >
> > I have a couple of applications in mind, one of which is
> recognizing
> > repeated-measures variables in "wide" longitudinal datasets, which
> often are
> > named in the form x1, x2, ... , xn.
> >
> > mysort(), which works by recursively slicing off pairs of non-digit
> and
> > digit strings, seems more complicated than it should have to be,
> and I
> > wonder whether anyone has a more elegant solution. I don't think
> that
> > efficiency is a serious issue for the applications I'm considering,
> but of
> > course a more efficient solution would be of interest.
> >
> > Thanks,
> >  John
> >
> > ------------------------------
> > John Fox, Professor
> > Department of Sociology
> > McMaster University
> > Hamilton, Ontario, Canada
> > web: socserv.mcmaster.ca/jfox
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >

--------------------------------
John Fox, Professor
Department of Sociology
McMaster University
Hamilton, Ontario, Canada
http://socserv.mcmaster.ca/jfox/