[R] Row exclude

Sat Jan 29 22:34:51 CET 2022

Rui has indeed improved my first attempt in several ways so my comments are now focused on another level. There is seemingly endless discussion here about what is base R. Questions as well as Answers that go beyond base R are often challenged and I understand why, even if I personally don't worry about it.

As I see it, R has many levels, like many modern programming languages, and some are built-in by default, while others are add-ons of various kinds and some are now seen as more commonly used than others. Some here, and NOT ME, seem particularly annoyed by the concept of the tidyverse existing or the corporate nature of RSTUDIO. I say, the more the better as long as they are well-designed and robust and efficient enough.

There are many ways you can use R in simple mode to the point where you do not even use vectors as intended but use loops to say add corresponding entries in two vectors one item at a time using an index, as you might do with earlier languages. That is perfectly valid in R, albeit not using the language as intended as A+B in R does that for you fairly trivially, albeit hiding a kind of loop being done behind the scenes. But if the two vectors are not the same length, it can lead to subtle errors if it recycles or broadcasts the shorter one as needed UNLESS that was intended. 

Like many languages, R has additional modes of a sort. it is very loosely Object-Oriented and some solutions to problems may make use of that or other features not always found in other languages such as being able to attach attributes of arbitrary nature to things. But someone taking a beginner course in R, or just using it in simple ways, generally does not know or care and being given a possible solution like that may not be very helpful.

R is fully a functional programming language and experienced users, like Rui clearly is, can make serious use of many paradigms like map/reduce to create what often are quite abstract solutions that can be tailored to do all kinds of things by simply changing the functions invoked or in this case also the data invoked. I was tempted to use a variant of his solution using the pmap() function that I am familiar with but it is not base R, but part of the "purr" package which is in the not-appreciated-here package of packages called the tidyverse, LOL! 

Pmap can take an arbitrary data.frame and look at it one row at a time and apply a function that sees all the columns. That function can be written so it applies your logic to each column entry for that row that you wish and combines the calculations to return something like TRUE/FALSE. In this case, it could be code connecting use of a regular expression on each column entry combined by the usual logical connectives like AND and NOT (using R notation) to return a TRUE or FALSE that pmap then combines into a vector and you use that to index the data.frame to keep only valid rows. BUT, I reconsidered using it here as it is a tad advanced and not pure R. Nor do I claim it is better than what Rui and others could come up with. It is just not as simple as the case we are looking at. 

R has another facet that needs to be used carefully that significantly alters some approaches as compared to a language like Python which has a much nicer object-oriented set of tools but does not have some of the delayed evaluation R supports and that sometimes get in the way as some people expect them to be evaluated sooner, or at all. I see strengths and weaknesses and try to use a language suited for my needs that also uses it mostly as intended.

I also ask if we have met the needs of the person who asked this question. If they do not reply and merely REPOST the same question with a shorter subject-line, then I suggest we all wasted our time trying. Proper etiquette, I might think, is to reply to some work show by others IN PUBLIC and especially to explain anything being asked by us and to let us know what worked for them or met their needs or show a portion of what code they finally implemented. Some of that may yet happen, but can anyone blame me for being a tad suspicious this time?

I tend to be interested in deeper discussions and many are outside the scope of this forum. So I acknowledge that discussing alternate methods including more abstract ones using functional programming or other tricks, is a bit outside what is expected here.

I want though to add one more idea. Can we agree that the user may have a more general concept to be considered here. That is the concept of having a data.frame where each column is purely numeric consisting of just 0 through 9 with perhaps no spaces, periods or commas or anything extraneous, OR purely alphabetic with no numerals allowed, and alphabetic in the same sense as Rui uses. Mind you, I do not see any reason for always using the current locale for something like names of people that may well be written with characters from another locale. I would think any string with all non-numeric characters might be allowed for the purpose.

Can you write a function that accepts only pure cells that either have no numerals or no alphabeticals but not a mixture? The same function can initially be applied to all columns of a data.frame that only is supposed to contain columns of one kind or the other but not combinations. You might begin by reading in all data as character mode with perhaps extraneous white space stripped. You then apply the above function to identify any rows that contain a mixed alphanumeric item and eliminate all such rows. For consistency, you might examine the resulting data.frame and try to convert all columns to numeric. Any that fail conversion attempts  are left as character but may possibly have anomalies like one of more alphabetic items mixed into  an otherwise numeric set of entries. That might require another filter run per column to identify those and either remove more rows or replace the bad ones with NA or a default like 0 in what is then convertable to a numeric column.

But my thought was that it is more complex to design something (as Rui did) that takes a list of intended column types, or a function that knows how to deal with each, as compared to an all-purpose function that just insists on purity at a local level and is a simpler program to write.

Avious

-----Original Message-----
From: Rui Barradas <ruipbarradas using sapo.pt>
To: Avi Gross <avigross using verizon.net>; dcarlson using tamu.edu <dcarlson using tamu.edu>; bgunter.4567 using gmail.com <bgunter.4567 using gmail.com>
Cc: r-help using r-project.org <r-help using r-project.org>
Sent: Sat, Jan 29, 2022 1:33 pm
Subject: Re: [R] Row exclude

Hello,

Thanks for the comments, a few others inline.

Às 18:04 de 29/01/2022, Avi Gross escreveu:
> There are many creative ways to solve problems and some may get you in 
> trouble if you present them in class while even in some work situations, 
> they may be hard for most to understand, let alone maintain and make 
> changes.
> 
> This group is amorphous enough that we have people who want "help" who 
> are new to the language, but also people who know plenty and encounter a 
> new kind of problem, and of course people who want to make use of what 
> they see as free labor.
> 
> Rui presented a very interesting idea and I like some aspects. But if 
> presented to most people, they might have to start looking up things.
> 
> But I admit I liked some of the ideas he uses and am adding them to my 
> bag of tricks. Some were overkill for this particular requirement but 
> that also makes them more general and useful.
> 
> First, was the use of locale-independent regular expressions like 
> [[:alpha:]] that match any combination of [:lower:] and [:upper:] and 
> thus are not restricted to ASCII characters. Since I do lots of my 
> activities in languages other than English and well might include names 
> with characters not normally found in English, or not even using an 
> overlapping  alphabet, I can easily encounter items in the Name column 
> that might not match [A-Za-z] but will match with [:alpha:].
> 
> I don't know if using [:digit:] has benefits over [0-9] and I do note 
> there was no requirement to match more complex numbers than integers so 
> no need to allow periods or scientific notation and so on.

Yes, I used locale-independent regular expressions. It's a habit I 
aquired a while ago. It took some time to stop using character ranges 
but once gone I'm more comfortable with the use of classes like 
[:alpha:] and [:digit:].
[After all my native language, (Portuguese) has 
cedillas(ES)/cedilhas(PT) and accented letters].

> 
> Then there is the use of mapply. The more general version of the problem 
> presented would include a data.frame with any number of columns, where a 
> subset of the columns might need to be checked for conditions that vary 
> across the columns but may include some broad categories of conditions 
> that might be re-used. If all the conditions are regular expression 
> matches you can build, then you can extend the list Rui used to have 
> more items and also include expressions that always match so that some 
> columns are effectively ignored:
> 
> 
> regex <- list("[[:digit:]]", "[[:alpha:]]", "[[:alpha:]]", "[.*])
> 
> 
> So this generalizes to N columns as long as you supply exactly N 
> patterns in the list, albeit mapply does recycle arguments if needed as 
> in the simplest case where you want all columns checked the same way.
> 
> Rui then uses an anonymous function to pass to mapply() and that is a 
> newish feature added recently to R, I think. It was perhaps meant 
> specifically to be used with the new pipe symbol, but can be used 
> anywhere but perhaps not in older versions of R.
> 
> 
> \(x, r) grepl(r, x)
> 

No, the new anonymous function wasn't specifically meant to be used with 
the new pipe operator, it was meant to be a short-hand notation for 
anonymous functions and used interchangeably with the old notation.

mapply(\(x, r), etc)
mapply(function(x, r) etc)

> 
> I note Rui also uses grepl() which returns a logical vector. I will show 
> my first attempt at the end where I used grep() to return index numbers 
> of matches instead. For this context, though, he made use of the fact 
> that mapply in this case returns a matrix of type logical:
> 
> i <- mapply(\(x, r) grepl(r, x), dat1, regex)
> 
>> i
> 
>        Name   Age Weight
> [1,] FALSE FALSE   TRUE
> [2,] FALSE FALSE  FALSE
> [3,] FALSE FALSE  FALSE
> [4,] FALSE  TRUE  FALSE
> [5,] FALSE FALSE  FALSE
> [6,]  TRUE FALSE  FALSE
> 
> And since R treats TRUE as 1 and FALSE as 0, then summing the rows gives 
> you a small integer between 0 and the number of columns, inclusive, and 
> only rows with no TRUE in them are wanted for this purpose:

And rowSums is a fast function.
> 
> 
> dat1[rowSums(i) == 0L, ]
> 
> All I all, nicely done, but not trivial to read without comments, LOL!
> 
> And, yes, it could be made even more obscure as a one-liner.
> 
> My first attempt was a bit more focused on the specific needs described. 
> I am not sure how the HTML destroyer in this mailing list might wreck 
> it, but I made it a two-statement version that is formatted on multiple 
> lines. An explanation first.
> 
> I looked at using grep() on one column at a time to look for what should 
> NOT be there and ask it to invert the answer so it effectively tells me 
> which rows to keep. So it tests column 1 ($Name) to see if it has digits 
> in it and returns FALSE if it finds them which later means toss this 
> row. It returns TRUE if that entry, so far, makes the row valid. But 
> note since I am not using grepl() it does not return TRUE/FALSE at all. 
> Rather it returns index numbers of the ones that now inverted are TRUE. 
> What goes in is a vector of individual items from a column of the data. 
> What goes out is the indices of which ones I want to keep that can be 
> used to index the entire data.frame. Based on the ample data, it returns 
> 1:5 as row 6 has a digit in "Jack3".
> 
> 
>    grep("[0-9]", dat1$Name, invert = TRUE)
> 
> 
> Similarly, two other grep() statements test if the second and third 
> columns contain any characters in "[a-zA-Z]" and return a similar index 
> vector if they are OK.
> 
> What I would then have are three numeric vectors, not a matrix. Each 
> contains a subset of all the indices:
> 
> 
>> grep("[0-9]", dat1$Name, invert = TRUE)
> [1] 1 2 3 4 5
>> grep("[a-zA-Z]", dat1$Age, invert = TRUE)
> [1] 1 2 3 5 6
>> grep("[a-zA-Z]", dat1$Weight, invert = TRUE)
> [1] 2 3 4 5 6
> 
> This set of data was designed to toss out one of each column so they all 
> are of the same length but need not be. Like Rui, my condition for 
> deciding which rows to keep is that all three of the index vectors have 
> a particular entry. He summed them as logicals, but my choice has small 
> integers so the way I combine them to exclude any not in all three is to 
> use a sort of set intersect method. The one built-in to R only handles 
> two at a time so I nested two calls to intersect but in a more general 
> case, I would use some package (or build my own function) that handles 
> intersecting any number of such items.
> 
> Here is the full code, minus the initialization.
> 
> 
> rows.keep <-
> intersect(intersect(grep("[0-9]", dat1$Name, invert = TRUE),
>                      grep("[a-zA-Z]", dat1$Age, invert = TRUE)),
>            grep("[a-zA-Z]", dat1$Weight, invert = TRUE))
> result <- dat1[rows.keep,]
> 
> 

Using the same idea, another two options, both with Reduce.

The 1st uses Avi's grep and regex's, the latter could be the character 
classes "[[:alpha:]]" and "[[:digit:]]" but this code is inspired in 
his. The results are put on a list and Reduce intersects the list 
members. Then subsetting is as usual.

The 2nd uses the fact that Mapis a wrapper for mapply that defaults to 
not simplifying its output. grep/invert will find the non-matches and 
Reduce intersects the result list, as above.
 From ?Map:

Map is a simple wrapper to mapply which does not attempt to simplify the 
result, similar to Common Lisp's mapcar (with arguments being recycled, 
however). Future versions may allow some control of the result type.

# 1st
grep_list <- list(
   grep("[0-9]", dat1$Name, invert = TRUE),
   grep("[a-zA-Z]", dat1$Age, invert = TRUE),
   grep("[a-zA-Z]", dat1$Weight, invert = TRUE)
)
keep1 <- Reduce(intersect, grep_list)
dat1[keep1,]

# 2nd
keep2 <- Map(\(x, r) grep(r, x, invert = TRUE), dat1, regex)
keep2 <- Reduce(intersect, keep2)

identical(keep1, keep2)
#[1] TRUE

Hope this helps,

Rui Barradas

> 
> 
> 
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Rui Barradas <ruipbarradas using sapo.pt>
> To: David Carlson <dcarlson using tamu.edu>; Bert Gunter <bgunter.4567 using gmail.com>
> Cc: r-help using R-project.org (r-help using r-project.org) <r-help using r-project.org>
> Sent: Sat, Jan 29, 2022 3:46 am
> Subject: Re: [R] Row exclude
> 
> Hello,
> 
> Getting creative, here is another way with mapply.
> 
> 
> regex <- list("[[:digit:]]", "[[:alpha:]]", "[[:alpha:]]")
> 
> i <- mapply(\(x, r) grepl(r, x), dat1, regex)
> dat1[rowSums(i) == 0L, ]
> 
> #  Name Age Weight
> #2   Bob   25       142
> #3 Carol   24       120
> #5  Katy   35       160
> 
> 
> Hope this helps,
> 
> Rui Barradas