[R] sorting variable names containing digits

John Fox jfox at mcmaster.ca
Mon Dec 22 02:33:59 CET 2008


Dear r-helpers,

I'm looking for a way of sorting variable names in a "natural" order, when
the names are composed of digits and other characters. I know that this is a
vague idea, and that sorting character strings is a complex topic, but
perhaps a couple of examples will clarify what I mean:

> s <- c("x1b", "x1a", "x02b", "x02a", "x02", "y1a1", "y10a2", 
+   "y10a10", "y10a1", "y2", "var10a2", "var2", "y10")

> sort(s)
 [1] "var10a2" "var2"    "x02"     "x02a"    "x02b"    "x1a"    
 [7] "x1b"     "y10"     "y10a1"   "y10a10"  "y10a2"   "y1a1"   
[13] "y2"
     
> mysort(s)
 [1] "var2"    "var10a2" "x1a"     "x1b"     "x02"     "x02a"   
 [7] "x02b"    "y1a1"    "y2"      "y10"     "y10a1"   "y10a2"  
[13] "y10a10" 
   
> t <- c("q10.1.1", "q10.2.1", "q2.1.1", "q10.10.2")

> sort(t)
[1] "q10.1.1"  "q10.10.2" "q10.2.1"  "q2.1.1" 
 
> mysort(t)
[1] "q2.1.1"   "q10.1.1"  "q10.2.1"  "q10.10.2"

Here, sort() is the standard R function and mysort() is a replacement, which
sorts the names into the order that seems natural to me, at least in the
cases that I've tried:

mysort <- function(x){
  sort.helper <- function(x){
    prefix <- strsplit(x, "[0-9]")
    prefix <- sapply(prefix, "[", 1)
    prefix[is.na(prefix)] <- ""
    suffix <- strsplit(x, "[^0-9]")
    suffix <- as.numeric(sapply(suffix, "[", 2))
    suffix[is.na(suffix)] <- -Inf
    remainder <- sub("[^0-9]+", "", x)
    remainder <- sub("[0-9]+", "", remainder)
    if (all (remainder == "")) list(prefix, suffix)
    else c(list(prefix, suffix), Recall(remainder))
    }
  ord <- do.call("order", sort.helper(x))
  x[ord]
   } 

I have a couple of applications in mind, one of which is recognizing
repeated-measures variables in "wide" longitudinal datasets, which often are
named in the form x1, x2, ... , xn.
   
mysort(), which works by recursively slicing off pairs of non-digit and
digit strings, seems more complicated than it should have to be, and I
wonder whether anyone has a more elegant solution. I don't think that
efficiency is a serious issue for the applications I'm considering, but of
course a more efficient solution would be of interest.

Thanks,
 John

------------------------------
John Fox, Professor
Department of Sociology
McMaster University
Hamilton, Ontario, Canada
web: socserv.mcmaster.ca/jfox



More information about the R-help mailing list