[R] assumptions about how things are done

Sat Oct 9 21:35:55 CEST 2021

This is supposed to be a forum for help so general and philosophical
discussions belong elsewhere, or nowhere.

Having said that, I want to make a brief point. Both new and experienced
people make implicit assumptions about the code they use. Often nobody looks
at how the sausage is made. The recent discussion of ifelse() made me take a
look and I was not thrilled.

My NA�VE view was that ifelse() was implemented as a sort of loop construct.
I mean if I have a vector of length N and perhaps a few other vectors of the
same length, I might say:

result <- ifelse(condition-on-vector-A, result-if-true-using-vectors,
result-if-false-using-vectors)

So say I want to take a vector of integers from 1 to N and make an output a
second vector where you have either a prime number or NA. If I have a
function called is.prime() that checks a single number and returns
TRUE/FALSE, it might look like this:

primed <- ifelse(is.prime(A, A, NA)

So A[1] will be mapped to 1 and A[2} to 2 and A[3] to 3, but A[4] being
composite becomes NA and so on.

If you wrote the above using loops, it would be to range from index 1 to N
and apply the above. There are many complications as R allows vectors to be
longer or to be repeated as needed.

What I found ifelse() as implemented to do, is sort of like this:

Make a vector of the right length for the results, initially empty.

Make a vector evaluating the condition so it is effectively a Boolean
result.

Calculate which indices are TRUE. Secondarily, calculate another set of
indices that are false.

Calculate ALL the THEN conditions and ditto all the ELSE conditions.

Now copy into the result all the THEN values indexed by the TRUE above and
than all the ELSE values indicated by the FALSE above.

In plain English, make a result from two other results based on picking
either one from menu A or one from menu B.

That is not a bad algorithm and in a vectorized language like R, maybe even
quite effective and efficient. It does lots of extra work as by definition
it throws at least half away.

I suspect the implementation could be made much faster by making some of it
done internally using a language like C.

But now that I know what this implementation did, I might have some qualms
at using it in some situations. The original complaint led to other
observations and needs and perhaps blindly using a supplied function like
ifelse() may not be a decent solution for some needs.

I note how I had to reorient my work elsewhere using a group of packages
called the tidyverse when they added a function to allow rowwise
manipulation of the data as compared to an ifelse-like method using all
columns at once. There is room for many approaches and if a function may not
be doing quite what you want, something else may better meet your needs OR
you may want to see if you can copy the existing function and modify it for
your own personal needs.

In the case we mentioned, the goal was to avoid printing selected warnings.
Since the function is readable, it can easily be modified in a copy to find
what is causing the warnings and either rewrite a bit to avoid them or start
over with perhaps your own function that tests before doing things and
avoids tripping the condition (generating a NaN) entirely.

Like may languages, R is a bit too rich. You can piggyback on the work of
others but with some caution as they did not necessarily have you in mind
with what they created.

	[[alternative HTML version deleted]]