[R] Question about Rfast colMins and colMaxs

Bert Gunter bgunter@4567 @end|ng |rom gm@||@com
Wed Dec 1 18:23:06 CET 2021


"As to obtaining min or max of columns, such a question is nonsensical
when the column contains character data,"

That is false.

First, if it's plain (unclassed) character data, character data is
sorted lexicographically, which depends on the locale, as Petr already
pointed out.

Perhaps more useful, however, is that of the character data has been
classed as an *ordered* factor, then max, min, and range are all
meaningfully interpreted for it according to the ordering:

> z <- ordered(sample(letters[1:3], 10,rep = TRUE), levels = c('c','a','b'))
> z
 [1] a a c b b c a a b c
Levels: c < a < b
> max(z)
[1] b
Levels: c < a < b
> min(z)
[1] c
Levels: c < a < b
> range(z)
[1] c b
Levels: c < a < b
> sort(z)
 [1] c c c a a a a b b b
Levels: c < a < b
### etc.

See ?ordered for details, and particularly note:
"There are "factor" and "ordered" methods for the group generic Ops
which provide methods for the Comparison operators, and for the min,
max, and range generics in Summary of "ordered". (The rest of the
groups and the Math group generate an error as they are not meaningful
for factors.)"

(You would have to go down a rabbit hole to learn about group generics
for S3 classes, but I think you can ignore those details for the
purposes here).

Bert Gunter

Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Wed, Dec 1, 2021 at 8:41 AM Jeff Newmiller <jdnewmil using dcn.davis.ca.us> wrote:
>
> Avi: As I understand it, the definition regarding what is on-topic is not targeted at tidyverse... it addresses the futility of trying to support the thousands of problem domains represented by contributed packages in one mailing list without drowning the subscribers in discussions about problem domains they are not interested in. This mailing list happens to be about the R language, and extends as far as the packages included with the R language. Tidyverse just happens to be outside this scope (though some here are more opposed to using it than others and express that sentiment in their responses).
>
> Stephen: As to obtaining min or max of columns, such a question is nonsensical when the column contains character data, and mostly nonsensical when the data are binary regardless of what package you are using. From this perspective, it only makes sense to look at extremes on numeric columns selected from the data after they are loaded into a data frame.
>
>     dta <- read.csv( "yourdata.csv" )
>     dta_num <- dta[ , c("Elapsed_Time", "Age" ) ]
>
> I don't know this package Rfast or its functions, but I would imagine that you could then use
>
>     dta_maxs <- colMaxs( as.matrix( dta_num ) )
>
> which might be a few microseconds faster than a typical base R solution:
>
>     dta_maxs <- apply( dta_num, 2, max )
>
> which works fine without loading a special package for this rather basic functionality.
>
> Sometimes packages get developed to meet unusual needs (thousands of columns) or while chasing an optimization that turns out to be marginal (the real time sink may be in converting a data frame into a matrix so the compiled C code can then process it a smidge quicker), and none of this should be interfering in your learning of R but you end up following biased blogs or package vignettes that may or may not deliver on their promise to simplify your tasks.
>
> Some aspects of tidyverse are quite effective and I use them regularly... but there are a lot of these less valuable contributions also that are distractions or are best for corner cases. You can only figure out the difference by learning the basics first and comparing approaches (sometimes a package that requires you to manually convert to matrix is a good idea, but sometimes it is just a pain) for usefulness and/or speed (e.g. using the microbenchmark contributed package).
>
> On November 30, 2021 7:42:35 PM PST, Avi Gross via R-help <r-help using r-project.org> wrote:
> >Stephen,
> >
> >Although what is in the STANDARD R distribution can vary several ways, in
> >general, if you need to add a line like:
> >
> >library(something)
> >or
> >require(something)
> >
> >and your code does not work unless you have done that, then you can imagine
> >it is not sort of built in to R as it starts.
> >
> >Having said that, tons of exceptions may exist that cause R to load in
> >things on your machine for everyone or just for you without you having to
> >notice.
> >
> >I think this forum lately has been deluged with questions about all kinds of
> >add-on packages and in particular, lots of the ones in the tidyverse.
> >Clearly the purpose here is not that broad.
> >
> >But since I use some packages like the tidyverse extensively, and I am far
> >from alone, I wonder if someday the powers that be realize it is a losing
> >battle to exclude at least some of it. It would be so nice not having to
> >include a long list of packages for some programs or somehow arrange that
> >people using something you shared had installed and loaded them. But there
> >are too many packages out there of varying quality and usefulness and
> >popularity with more every day being added. Worse, many are somewhat
> >incompatible such as having functions with the same names that hide earlier
> >ones loaded.
> >
> >Base R doe come with functions like colSums and colMeans and similar row
> >functions. But as mentioned, a data.frame is a list of vectors and R
> >supports certain functional programming constructs over lists using things
> >like:
> >
> >lapply(df, min)
> >sapply(df, min)
> >
> >And actually quite a few ways depending on what info you want back and
> >whether you insist it be returned as a list or vector or other things . You
> >can even supply additional arguments that might be needed such as if you
> >want to ignore any NA values,
> >
> >lapply(df, min, na.rm=TRUE
> >
> >The package you looked at it is trying to be fast and uses what looks like
> >compiled external code but so does lapply.
> >
> >If this is too bothersome for you, consider making a one-liner function like
> >this:
> >
> >mycolMins <- function(df, ...) lapply(df, min, ...)
> >
> >Once defined, you can use that just fine and not think about it again and I
> >note this answer (like others) is offering you something in base R that
> >works fine on data.frames and the like.
> >
> >You can extend to many similar ideas like this one that calulates the min
> >unless you over-ride it with max or mean or sd or a bizarre function like
> >`[` so a call to:
> >
> >mycolCalc(df, `[`, 3)
> >
> >Will return exactly the third items in each row!
> >
> >I find it to be very common for someone these days to do a quick search for
> >a way to do something in a language like R and not really look to see if it
> >is a standard way or something special/ Matrices in R are not quite the same
> >as some other objects like a data.frame or tibble and a package written to
> >be used on one may (or may not) happen to work with another. Some packages
> >are carefully written to try to detect what kind of object it gets and when
> >possible convert it to another. The "apply" function is meant for matrices
> >but if it sees something else it looks ta the dimensionality and tries to
> >coerce it with as.matrix or as.array first. As others have noted, this mean
> >a data.frame containing non-numeric parts may fail or should have any other
> >columns hidden/removed as in this df that has some non-numeric fields:
> >
> >> df
> >i       s   f     b i2
> >1 1   hello 1.2  TRUE  5
> >2 2   there 2.3 FALSE  4
> >3 3 goodbye 3.4  TRUE  3
> >
> >So a bit more complex one-liner removes any non-numeric columns like this:
> >
> >> mycolMins(df[, sapply(df, is.numeric)])
> >$i
> >[1] 1
> >
> >$f
> >[1] 1.2
> >
> >$i2
> >[1] 3
> >
> >Clearly converting that to a matrix while whole would result in everything
> >being converted to character and a minimum may be elusive.
> >
> >-----Original Message-----
> >From: R-help <r-help-bounces using r-project.org> On Behalf Of Stephen H. Dawson,
> >DSL via R-help
> >Sent: Tuesday, November 30, 2021 5:37 PM
> >To: Bert Gunter <bgunter.4567 using gmail.com>
> >Cc: r-help using r-project.org
> >Subject: Re: [R] Question about Rfast colMins and colMaxs
> >
> >Oh, you are segmenting standard R from the rest of R.
> >
> >Well, that part did not come across to me in your original reply. I am not
> >clear on a standard versus non-standard list. I will look into this aspect
> >and see what I can learn going forward.
> >
> >
> >Thanks,
> >*Stephen Dawson, DSL*
> >/Executive Strategy Consultant/
> >Business & Technology
> >+1 (865) 804-3454
> >http://www.shdawson.com <http://www.shdawson.com>
> >
> >
> >On 11/30/21 5:26 PM, Bert Gunter wrote:
> >> ... but Rfast is *not* a "standard" package, as the rest of the PG
> >> excerpt says. So contact the maintainer and ask him/her what they
> >> think the best practice should be for their package. As has been
> >> pointed out already, it appears to differ from the usual "read it in
> >> as a data frame" procedure.
> >>
> >> Bert
> >>
> >> On Tue, Nov 30, 2021 at 2:11 PM Stephen H. Dawson, DSL
> >> <service using shdawson.com> wrote:
> >>> Right, R Studio is not R.
> >>>
> >>> However, the Rfast package is part of R.
> >>>
> >>> https://cran.r-project.org/web/packages/Rfast/index.html
> >>>
> >>> So, rephrasing my question...
> >>> What is the best practice to bring a csv file into R so it can be
> >>> accessed by colMaxs and colMins, please?
> >>>
> >>> *Stephen Dawson, DSL*
> >>> /Executive Strategy Consultant/
> >>> Business & Technology
> >>> +1 (865) 804-3454
> >>> http://www.shdawson.com <http://www.shdawson.com>
> >>>
> >>>
> >>> On 11/30/21 3:19 PM, Bert Gunter wrote:
> >>>> RStudio is **not** R. In particular, the so-called TidyVerse
> >>>> consists of all *non*-standard contributed packages, about which the PG
> >says:
> >>>>
> >>>> "For questions about functions in standard packages distributed with
> >>>> R (see the FAQ Add-on packages in R), ask questions on R-help.
> >>>> [The link is:
> >>>> https://cran.r-project.org/doc/FAQ/R-FAQ.html#Add-on-packages-in-R
> >>>> This gives the list of current _standard_ packages]
> >>>>
> >>>> If the question relates to a contributed package , e.g., one
> >>>> downloaded from CRAN, try contacting the package maintainer first.
> >>>> You can also use find("functionname") and
> >>>> packageDescription("packagename") to find this information. Only
> >>>> send such questions to R-help or R-devel if you get no reply or need
> >>>> further assistance. This applies to both requests for help and to
> >>>> bug reports."
> >>>>
> >>>> Note that RStudio maintains its own help resources at:
> >>>> https://community.rstudio.com/
> >>>> This is where questions about the TidyVerse, ggplot, etc. should be
> >posted.
> >>>>
> >>>>
> >>>>
> >>>> Bert Gunter
> >>>>
> >>>> "The trouble with having an open mind is that people keep coming
> >>>> along and sticking things into it."
> >>>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >>>>
> >>>> On Tue, Nov 30, 2021 at 10:55 AM Stephen H. Dawson, DSL via R-help
> >>>> <r-help using r-project.org> wrote:
> >>>>> Hi,
> >>>>>
> >>>>>
> >>>>> I am working to understand the Rfast functions of colMins and
> >>>>> colMaxs. I worked through the example listed on page 54 of the PDF.
> >>>>>
> >>>>> https://cran.r-project.org/web/packages/Rfast/index.html
> >>>>>
> >>>>> https://cran.r-project.org/web/packages/Rfast/Rfast.pdf
> >>>>>
> >>>>> My data is in a CSV file. So, I bring it into R Studio using:
> >>>>> Data <- read.csv("./input/DataSet05.csv", header=T)
> >>>>>
> >>>>> However, I read the instructions listed on page 54 of the PDF
> >>>>> saying I need to bring data into R using a matrix. I think read.csv
> >>>>> brings the data in as a dataframe. I think colMins is failing
> >>>>> because it is looking for a matrix but finds a dataframe.
> >>>>>
> >>>>>    > colMaxs(Data)
> >>>>> Error in colMaxs(Data) :
> >>>>>      Not compatible with requested type: [type=list; target=double].
> >>>>>    > colMins(Data, na.rm = TRUE)
> >>>>> Error in colMins(Data, na.rm = TRUE) :
> >>>>>      unused argument (na.rm = TRUE)
> >>>>>    > colMins(Data, value = FALSE, parallel = FALSE) Error in
> >>>>> colMins(Data, value = FALSE, parallel = FALSE) :
> >>>>>      Not compatible with requested type: [type=list; target=double].
> >>>>>
> >>>>> QUESTION
> >>>>> What is the best practice to bring a csv file into R Studio so it
> >>>>> can be accessed by colMaxs and colMins, please?
> >>>>>
> >>>>>
> >>>>> Thanks,
> >>>>> --
> >>>>> *Stephen Dawson, DSL*
> >>>>> /Executive Strategy Consultant/
> >>>>> Business & Technology
> >>>>> +1 (865) 804-3454
> >>>>> http://www.shdawson.com <http://www.shdawson.com>
> >>>>>
> >>>>> ______________________________________________
> >>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>> PLEASE do read the posting guide
> >>>>> http://www.R-project.org/posting-guide.html
> >>>>> and provide commented, minimal, self-contained, reproducible code.
> >>>
> >
> >______________________________________________
> >R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >https://stat.ethz.ch/mailman/listinfo/r-help
> >PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >and provide commented, minimal, self-contained, reproducible code.
> >
> >______________________________________________
> >R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >https://stat.ethz.ch/mailman/listinfo/r-help
> >PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >and provide commented, minimal, self-contained, reproducible code.
>
> --
> Sent from my phone. Please excuse my brevity.
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list