[R] Adding SORT to UNIQUE

Tue Dec 21 20:44:42 CET 2021

Dear Jeff,

I haven't investigated your claim systematically, but out of curiosity, I did try extending my previous example, admittedly arbitrarily. In doing so, I assumed that you intended col in the first case to be the column subscript, not the row subscript. Here's what I got (on a newish M1 MacBook Pro):

> system.time(
+   for ( col in colnames( D ) ) {
+     idx <- sample(1e6, 1000)
+     D[ idx, col ] <- idx
+   }
+ )
   user  system elapsed 
  0.913   6.545  43.737 

> system.time(
+   for ( col in colnames( D ) ) {
+     idx <- sample(1e6, 1000)
+     D[[ col ]][ idx ] <- idx
+   }
+ )
   user  system elapsed 
  0.876   6.828  52.033

Best,
 John

On 2021-12-21, 1:04 PM, "R-help on behalf of Jeff Newmiller" <r-help-bounces using r-project.org on behalf of jdnewmil using dcn.davis.ca.us> wrote:

    When your brain is wired to treat a data frame like a matrix, then you think things like

    for ( col in colnames( col ) ) {
      idx <- expr
      D[ col, idx ] <- otherexpr
    }

    are reasonable, when

    for ( col in colnames( col ) ) {
      idx <- expr
      D[[ col ]][ idx ] <- otherexpr
    }

    does actually run significantly faster.

    On December 21, 2021 9:28:52 AM PST, "Fox, John" <jfox using mcmaster.ca> wrote:
    >Dear Jeff,
    >
    >On 2021-12-21, 11:59 AM, "R-help on behalf of Jeff Newmiller" <r-help-bounces using r-project.org on behalf of jdnewmil using dcn.davis.ca.us> wrote:
    >
    >    Intuitive, perhaps, but noticably slower. 
    >
    >I think that in most applications, one wouldn't notice the difference; for example:
    >
    >> D <- data.frame(matrix(rnorm(1000*1e6), 1e6, 1000))
    >
    >> microbenchmark(D[, 1])
    >Unit: microseconds
    >   expr   min    lq    mean median     uq    max neval
    > D[, 1] 3.321 3.362 3.98561  3.444 3.5875 51.291   100
    >
    >> microbenchmark(D[[1]])
    >Unit: microseconds
    >   expr   min    lq    mean median     uq    max neval
    > D[[1]] 1.722 1.763 1.99137  1.804 1.8655 17.876   100
    >
    >Best,
    > John
    >
    >
    >    And it doesn't work on tibbles by design. Data frames are lists of columns.
    >
    >
    >    On December 21, 2021 8:38:35 AM PST, Duncan Murdoch <murdoch.duncan using gmail.com> wrote:
    >    >On 21/12/2021 11:31 a.m., Duncan Murdoch wrote:
    >    >> On 21/12/2021 11:20 a.m., Stephen H. Dawson, DSL wrote:
    >    >>> Thanks for the reply.
    >    >>>
    >    >>> sort(unique(Data[1]))
    >    >>> Error in `[.data.frame`(x, order(x, na.last = na.last, decreasing =
    >    >>> decreasing)) :
    >    >>>      undefined columns selected
    >    >> 
    >    >> That's the wrong syntax:  Data[1] is not "column one of Data".  Use
    >    >> Data[[1]] for that, so
    >    >> 
    >    >>     sort(unique(Data[[1]]))
    >    >
    >    >Actually, I'd probably recommend
    >    >
    >    >   sort(unique(Data[, 1]))
    >    >
    >    >instead.  This treats Data as a matrix rather than as a list. 
    >    >Dataframes are lists that look like matrices, but to me the matrix 
    >    >aspect is usually more intuitive.
    >    >
    >    >Duncan Murdoch
    >    >
    >    >> 
    >    >> I think Rui already pointed out the typo in the quoted text below...
    >    >> 
    >    >> Duncan Murdoch
    >    >> 
    >    >>>
    >    >>> The recommended syntax did not work, as listed above.
    >    >>>
    >    >>> What I want is the sort of distinct column output. Again, the column may
    >    >>> be text or numbers. This is a huge analysis effort with data coming at
    >    >>> me from many different sources.
    >    >>>
    >    >>>
    >    >>> *Stephen Dawson, DSL*
    >    >>> /Executive Strategy Consultant/
    >    >>> Business & Technology
    >    >>> +1 (865) 804-3454
    >    >>> http://www.shdawson.com <http://www.shdawson.com>
    >    >>>
    >    >>>
    >    >>> On 12/21/21 11:07 AM, Duncan Murdoch wrote:
    >    >>>> On 21/12/2021 10:16 a.m., Stephen H. Dawson, DSL via R-help wrote:
    >    >>>>> Thanks everyone for the replies.
    >    >>>>>
    >    >>>>> It is clear one either needs to write a function or put the unique
    >    >>>>> entries into another dataframe.
    >    >>>>>
    >    >>>>> It seems odd R cannot sort a list of unique column entries with ease.
    >    >>>>> Python and SQL can do it with ease.
    >    >>>>
    >    >>>> I've seen several responses that looked pretty simple.  It's hard to
    >    >>>> beat sort(unique(x)), though there's a fair bit of confusion about
    >    >>>> what you actually want.  Maybe you should post an example of the code
    >    >>>> you'd use in Python?
    >    >>>>
    >    >>>> Duncan Murdoch
    >    >>>>
    >    >>>>>
    >    >>>>> QUESTION
    >    >>>>> Is there a simpler means than other than the unique function to capture
    >    >>>>> distinct column entries, then sort that list?
    >    >>>>>
    >    >>>>>
    >    >>>>> *Stephen Dawson, DSL*
    >    >>>>> /Executive Strategy Consultant/
    >    >>>>> Business & Technology
    >    >>>>> +1 (865) 804-3454
    >    >>>>> http://www.shdawson.com <http://www.shdawson.com>
    >    >>>>>
    >    >>>>>
    >    >>>>> On 12/20/21 5:53 PM, Rui Barradas wrote:
    >    >>>>>> Hello,
    >    >>>>>>
    >    >>>>>> Inline.
    >    >>>>>>
    >    >>>>>> Às 21:18 de 20/12/21, Stephen H. Dawson, DSL via R-help escreveu:
    >    >>>>>>> Thanks.
    >    >>>>>>>
    >    >>>>>>> sort(unique(Data[[1]]))
    >    >>>>>>>
    >    >>>>>>> This syntax provides row numbers, not column values.
    >    >>>>>>
    >    >>>>>> This is not right.
    >    >>>>>> The syntax Data[1] extracts a sub-data.frame, the syntax Data[[1]]
    >    >>>>>> extracts the column vector.
    >    >>>>>>
    >    >>>>>> As for my previous answer, it was not addressing the question, I
    >    >>>>>> misinterpreted it as being a question on how to sort by numeric order
    >    >>>>>> when the data is not numeric. Here is a, hopefully, complete answer.
    >    >>>>>> Still with package stringr.
    >    >>>>>>
    >    >>>>>>
    >    >>>>>> cols_to_sort <- 1:4
    >    >>>>>>
    >    >>>>>> Data2 <- lapply(Data[cols_to_sort], \(x){
    >    >>>>>>      stringr::str_sort(unique(x), numeric = TRUE)
    >    >>>>>> })
    >    >>>>>>
    >    >>>>>>
    >    >>>>>> Or using Avi's suggestion of writing a function to do all the work and
    >    >>>>>> simplify the lapply loop later,
    >    >>>>>>
    >    >>>>>>
    >    >>>>>> unisort2 <- function(vec, ...) stringr::str_sort(unique(vec), ...)
    >    >>>>>> Data2 <- lapply(Data[cols_to_sort], unisort, numeric = TRUE)
    >    >>>>>>
    >    >>>>>>
    >    >>>>>> Hope this helps,
    >    >>>>>>
    >    >>>>>> Rui Barradas
    >    >>>>>>
    >    >>>>>>
    >    >>>>>>>
    >    >>>>>>> *Stephen Dawson, DSL*
    >    >>>>>>> /Executive Strategy Consultant/
    >    >>>>>>> Business & Technology
    >    >>>>>>> +1 (865) 804-3454
    >    >>>>>>> http://www.shdawson.com <http://www.shdawson.com>
    >    >>>>>>>
    >    >>>>>>>
    >    >>>>>>> On 12/20/21 11:58 AM, Stephen H. Dawson, DSL via R-help wrote:
    >    >>>>>>>> Hi,
    >    >>>>>>>>
    >    >>>>>>>>
    >    >>>>>>>> Running a simple syntax set to review entries in dataframe columns.
    >    >>>>>>>> Here is the working code.
    >    >>>>>>>>
    >    >>>>>>>> Data <- read.csv("./input/Source.csv", header=T)
    >    >>>>>>>> describe(Data)
    >    >>>>>>>> summary(Data)
    >    >>>>>>>> unique(Data[1])
    >    >>>>>>>> unique(Data[2])
    >    >>>>>>>> unique(Data[3])
    >    >>>>>>>> unique(Data[4])
    >    >>>>>>>>
    >    >>>>>>>> I would like to add sort the unique entries. The data in the various
    >    >>>>>>>> columns are not defined as numbers, but also text. I realize 1 and
    >    >>>>>>>> 10 will not sort properly, as the column is not defined as a number,
    >    >>>>>>>> but want to see what I have in the columns viewed as sorted.
    >    >>>>>>>>
    >    >>>>>>>> QUESTION
    >    >>>>>>>> What is the best process to sort unique output, please?
    >    >>>>>>>>
    >    >>>>>>>>
    >    >>>>>>>> Thanks.
    >    >>>>>>>
    >    >>>>>>> ______________________________________________
    >    >>>>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
    >    >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
    >    >>>>>>> PLEASE do read the posting guide
    >    >>>>>>> http://www.R-project.org/posting-guide.html
    >    >>>>>>> and provide commented, minimal, self-contained, reproducible code.
    >    >>>>>>
    >    >>>>>
    >    >>>>> ______________________________________________
    >    >>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
    >    >>>>> https://stat.ethz.ch/mailman/listinfo/r-help
    >    >>>>> PLEASE do read the posting guide
    >    >>>>> http://www.R-project.org/posting-guide.html
    >    >>>>> and provide commented, minimal, self-contained, reproducible code.
    >    >>>>
    >    >>>>
    >    >>>
    >    >>>
    >    >>
    >    >
    >    >______________________________________________
    >    >R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
    >    >https://stat.ethz.ch/mailman/listinfo/r-help
    >    >PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
    >    >and provide commented, minimal, self-contained, reproducible code.
    >
    >    -- 
    >    Sent from my phone. Please excuse my brevity.
    >
    >    ______________________________________________
    >    R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
    >    https://stat.ethz.ch/mailman/listinfo/r-help
    >    PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
    >    and provide commented, minimal, self-contained, reproducible code.
    >

    -- 
    Sent from my phone. Please excuse my brevity.

    ______________________________________________
    R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
    https://stat.ethz.ch/mailman/listinfo/r-help
    PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
    and provide commented, minimal, self-contained, reproducible code.