[R] Adding SORT to UNIQUE

Avi Gross @v|gro@@ @end|ng |rom ver|zon@net
Mon Dec 20 22:37:53 CET 2021


Duncan and Martin and ...,

There are multiple groups where people discuss R and this seems to be a help group. The topic keeps coming up as to whether you should teach anything other than base R and I claim it depends. 

Many packages are indeed written mostly using base R or using other packages that ultimately use only base R constructions. So, obviously, those packages can be gotten around by re-writing your own complicated code to do the things you want. Some have parts written using external libraries such as parts written in C and that makes it harder.

I think the debate is not one-sided and the whole purpose of creating and sharing libraries is to make life easier for those who see tools that already exist and perhaps may even be tested by others and deemed worthy. For people with serious programming needs, I suggest often 80% of your code can easily be made easier or more powerful AND easier to read than writing many times more fairly complex code in the base language.

BUT, if someone is teaching a language like R one step at a time and a student asks a question, it is very wrong to tell them of a package to do all the work when clearly they need practice using a small subset of base functionality. If they are asked to implement a sort algorithm using say a recursive merge sort, then the assignment involves using functions that call themselves and so on.

So I was thinking of how I might have dealt with finding unique members of a vector (called "temp" below) that contained character text as in "2.2" and "99.6" and "10.1" and "1.1" if they are all numeric POTENTIALLY and you want them sorted as if they were numeric but then returned as character data. My first attempt at it was this:

as.character(sort(as.numeric(unique(temp))))

Then I considered that if there was no guarantee that everything in temp could be coerced by as.numeric() then do I handle that? I see the following extension of temp causes an error and temp2 is not created:

  > as.numeric(c(temp. "blab")) -> temp2
  Error: unexpected string constant in "as.numeric(c(temp. "blab""

So clearly to handle that I might change the one-liner into multiple lines and check for the above error in one of many ways such as embedding it inside a try() but then what is the requirement if it fails? One possibility might be to remove such entries and another is to add them back after sorting the rest but add them back where? If there are many, they all need to be sorted. Or should they become an NA, which is tolerated?

Get the idea?

It can take non-trivial work. Now if the package was made carefully, perhaps with options letting you specify among such choices as above, and it is trusted, why not use that if your needs may be complex?

One final note. In some places the functionality of unique() is done by a method that first sorts the data and then simply removes duplicates of the same thing that are adjacent. In those cases, the data would come out sorted or in reverse sort order. But that is not what the built-in unique() does as it basically keeps a list of what it has encountered and each new item is kept only if it is not already in the list. This produces output in the same order it entered. Think of how factors normally are done in R. The first item found is stored usually as a 1 then the next unique item as a 2 and so on but duplicates all share the same integer value and thus the levels only need be stored once. If you want them stored sorted, there are fairly easy ways to do that but there are packages like forcats that do such things often more consistently and easier. In a sense, you can do a unique by simply making your data into a factor and ask for levels(whatever) ad optionally sort them. Effectively, you are using it as a kind of set, and there are ways to also use sets in R.

-----Original Message-----
From: R-help <r-help-bounces using r-project.org> On Behalf Of Duncan Murdoch
Sent: Monday, December 20, 2021 12:51 PM
To: Martin Maechler <maechler using stat.math.ethz.ch>; Rui Barradas <ruipbarradas using sapo.pt>
Cc: Stephen H. Dawson, DSL via R-help <r-help using r-project.org>
Subject: Re: [R] Adding SORT to UNIQUE

On 20/12/2021 12:32 p.m., Martin Maechler wrote:
>>>>>> Rui Barradas
>>>>>>      on Mon, 20 Dec 2021 17:05:33 +0000 writes:
> 
>      > Hello,
>      > Package stringr has functions str_sort and str_order, both with an
>      > argument 'numeric' that will sort the numbers correctly.
>      > Maybe that's what you are looking for, see the example below.
> 
> 
>      > x <- sample(sprintf("ab%d", 1:20))     # shuffle the vector
>      > stringr::str_sort(x, numeric = TRUE)   # sort considering the numbers
> 
> Again:
> There's really no need to use non-base R here (and in almost all such 
> questions about string handling!) as Avi Gross' answer shows.

That gives a different sort order:

  stringr::str_sort(x, numeric = TRUE)

gives

  [1] "ab1"  "ab2"  "ab3"  "ab4"  "ab5"  "ab6"  "ab7"  "ab8"  "ab9" 
"ab10" "ab11" "ab12" "ab13" "ab14" "ab15" "ab16" "ab17"
[18] "ab18" "ab19" "ab20"

(with the numbers in order), while sort(x) gives

  [1] "ab1"  "ab10" "ab11" "ab12" "ab13" "ab14" "ab15" "ab16" "ab17" 
"ab18" "ab19" "ab2"  "ab20" "ab3"  "ab4"  "ab5"  "ab6"
[18] "ab7"  "ab8"  "ab9"

with the characters in order.  I don't think the "numeric" option is available in base R (though of course you could write a function to do it, so there's no *need*, but it's certainly more convenient to use the stringr function if that's the order you want).

Duncan Murdoch


> 
> 
>      > Hope this helps,
> 
>      > Rui Barradas
> 
> 
>      > Às 16:58 de 20/12/21, Stephen H. Dawson, DSL via R-help escreveu:
>      >> Hi,
>      >>
>      >>
>      >> Running a simple syntax set to review entries in dataframe columns. Here
>      >> is the working code.
>      >>
>      >> Data <- read.csv("./input/Source.csv", header=T)
>      >> describe(Data)
>      >> summary(Data)
>      >> unique(Data[1])
>      >> unique(Data[2])
>      >> unique(Data[3])
>      >> unique(Data[4])
>      >>
>      >> I would like to add sort the unique entries. The data in the various
>      >> columns are not defined as numbers, but also text. I realize 1 and 10
>      >> will not sort properly, as the column is not defined as a number, but
>      >> want to see what I have in the columns viewed as sorted.
>      >>
>      >> QUESTION
>      >> What is the best process to sort unique output, please?
>      >>
>      >>
>      >> Thanks.
> 
>      > ______________________________________________
>      > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>      > https://stat.ethz.ch/mailman/listinfo/r-help
>      > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>      > and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see 
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list