[R] How to create a new column based on the values from multiple columns which are matching a particular string?

Bert Gunter bgunter@4567 @end|ng |rom gm@||@com
Tue Jul 30 01:12:25 CEST 2019


While Eric's solution is correct( mod "corner" cases like all NA's in
a row), it can be made considerably more efficient.

One minor improvement can be made by using the idiom
any(x == "A")
instead of matching via %in% for the simple case of matching just a
single value.

However, a considerable improvement can be made by getting fancy,
taking advantage of do.call() and the pmax() function to mostly
vectorize the calculation. Here are the details and timing on a large
data frame.

(Note: I removed the names in the %in% approach for simplicity. It has
almost no effect on timings.
I also moved the as.integer() call out of the function so that it is
called only once at the end, which improves efficiency a bit)

1. Eric's original:
fun1 <-function(df,what)
{
  as.integer(unname(apply(df,MARGIN = 1,function(v) { what %in% v })))
}

2. Using any( x == "A") instead:
fun2 <- function(df,what)
{
   as.integer(unname(apply(df,MARGIN =1, function(x)any(x == what,
na.rm=TRUE))))
}

3. Getting fancy to use pmax()
fun3 <- function(df,what)
{
   z <- lapply(df,function(x)as.integer((x==what)))
   do.call(pmax,c(z,na.rm=TRUE))
}

Here are the timings:

> bigdf <- df[rep(1:10,1e4), rep(1:5, 50)]
> dim(bigdf)
[1] 100000    250

> system.time(res1 <- fun1(bigdf, "A"))
   user  system elapsed
  2.204   0.432   2.637
>
> system.time(res2 <- fun2(bigdf, "A"))
   user  system elapsed
  1.898   0.403   2.302
>
> system.time(res3 <- fun3(bigdf, "A"))
   user  system elapsed
  0.187   0.048   0.235

## 10 times faster!

>
> all.equal(res1,res2)
[1] TRUE
> all.equal(res1,res3)
[1] TRUE


NB: I freely admit that Eric's original solution may well be perfectly
adequate, and the speed improvement is pointless. In that case, maybe
this is at least somewhat instructive for someone.

Nevertheless, I would welcome further suggestions for improvement, as
I suspect my "fancy" approach is still a ways from what one can do (in
R code, without resorting to C++).

Cheers,
Bert


Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Mon, Jul 29, 2019 at 12:38 PM Eric Berger <ericjberger using gmail.com> wrote:
>
> Read the help for apply and %in%
>
> ?apply
> ?%in%
>
>
> Sent from my iPhone
>
> > On 29 Jul 2019, at 22:23, Ana Marija <sokovic.anamarija using gmail.com> wrote:
> >
> > Thank you so much! Just to confirm here MARGIN=1 indicates that "A" should appear at least once per row?
> >
> >> On Mon, Jul 29, 2019 at 1:53 PM Eric Berger <ericjberger using gmail.com> wrote:
> >> df$case <- apply(df,MARGIN = 1,function(v) { as.integer("A" %in% v) })
> >>
> >>
> >>> On Mon, Jul 29, 2019 at 9:02 PM Ana Marija <sokovic.anamarija using gmail.com> wrote:
> >>> sorry my bad, here is the edited version:
> >>>
> >>> so the data frame is this:
> >>>
> >>> df=data.frame(
> >>>   eye_problemsdisorders_f6148_0_1=c("A","C","D",NA,"D","A","C",NA,"B","A"),
> >>>   eye_problemsdisorders_f6148_0_2=c("B","C",NA,"A","C","B",NA,NA,"A","D"),
> >>>   eye_problemsdisorders_f6148_0_3=c("C","A","D","D","B","A",NA,NA,"A","B"),
> >>>   eye_problemsdisorders_f6148_0_4=c("D","D",NA,"B","A","C",NA,"C","A","B"),
> >>>   eye_problemsdisorders_f6148_0_5=c("C","C",NA,"D","B","C",NA,"D","D","B")
> >>>
> >>> and I would need to put inside the column which would be named "case" and values inside would be: 1,1,0,1,1,1,0,0,1,1
> >>>
> >>> so "case" column is where value "A" can be found in any column.
> >>>
> >>>> On Mon, Jul 29, 2019 at 12:53 PM Eric Berger <ericjberger using gmail.com> wrote:
> >>>> You may have a typo/misstatement in your question.
> >>>> You define a data frame with 5 columns, each of which has 10 elements, so your data frame has dimensions 10 x 5.
> >>>> Then you request a new COLUMN which will have only 5 elements, which is not allowed. All columns of a data frame
> >>>> must have the same length.
> >>>>
> >>>>> On Mon, Jul 29, 2019 at 8:42 PM Ana Marija <sokovic.anamarija using gmail.com> wrote:
> >>>>> I have data frame which looks like this:
> >>>>>
> >>>>> df=data.frame(
> >>>>>   eye_problemsdisorders_f6148_0_1=c(A,C,D,NA,D,A,C,NA,B,A),
> >>>>>   eye_problemsdisorders_f6148_0_2=c(B,C,NA,A,C,B,NA,NA,A,D),
> >>>>>   eye_problemsdisorders_f6148_0_3=c(C,A,D,D,B,A,NA,NA,A,B),
> >>>>>   eye_problemsdisorders_f6148_0_4=c(D,D,NA,B,A,C,NA,C,A,B),
> >>>>>   eye_problemsdisorders_f6148_0_5=c(C,C,NA,D,B,C,NA,D,D,B))
> >>>>>
> >>>>> In reality I have much more columns and they don't always match
> >>>>> "eye_problemsdisorders_f6148" this string, and there is much more rows.
> >>>>>
> >>>>> What I would like to do is create a new column, say named "case" where I
> >>>>> would have value "1" for every row where string "A" appears at least once
> >>>>> in any column, if not the value would be "0". So in the above example
> >>>>> column "case" would have these values: 1,1,1,1,0
> >>>>> Thanks
> >>>>> Ana
> >>>>>
> >>>>>         [[alternative HTML version deleted]]
> >>>>>
> >>>>> ______________________________________________
> >>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >>>>> and provide commented, minimal, self-contained, reproducible code.
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list