[R] How to create a new column based on the values from multiple columns which are matching a particular string?

Bert Gunter bgunter@4567 @end|ng |rom gm@||@com
Tue Jul 30 03:12:36 CEST 2019


Thanks, Dénes. Good points.

Bert


On Mon, Jul 29, 2019 at 4:33 PM Dénes Tóth <toth.denes using kogentum.hu> wrote:
>
> Hi Bert,
>
> see inline.
>
> On 7/30/19 1:12 AM, Bert Gunter wrote:
> > While Eric's solution is correct( mod "corner" cases like all NA's in
> > a row), it can be made considerably more efficient.
> >
> > One minor improvement can be made by using the idiom
> > any(x == "A")
> > instead of matching via %in% for the simple case of matching just a
> > single value.
> >
> > However, a considerable improvement can be made by getting fancy,
> > taking advantage of do.call() and the pmax() function to mostly
> > vectorize the calculation. Here are the details and timing on a large
> > data frame.
> >
> > (Note: I removed the names in the %in% approach for simplicity. It has
> > almost no effect on timings.
> > I also moved the as.integer() call out of the function so that it is
> > called only once at the end, which improves efficiency a bit)
> >
> > 1. Eric's original:
> > fun1 <-function(df,what)
> > {
> >    as.integer(unname(apply(df,MARGIN = 1,function(v) { what %in% v })))
> > }
> >
> > 2. Using any( x == "A") instead:
> > fun2 <- function(df,what)
> > {
> >     as.integer(unname(apply(df,MARGIN =1, function(x)any(x == what,
> > na.rm=TRUE))))
> > }
> >
> > 3. Getting fancy to use pmax()
> > fun3 <- function(df,what)
> > {
> >     z <- lapply(df,function(x)as.integer((x==what)))
> >     do.call(pmax,c(z,na.rm=TRUE))
> > }
> >
> > Here are the timings:
> >
> >> bigdf <- df[rep(1:10,1e4), rep(1:5, 50)]
> >> dim(bigdf)
> > [1] 100000    250
> >
> >> system.time(res1 <- fun1(bigdf, "A"))
> >     user  system elapsed
> >    2.204   0.432   2.637
> >>
> >> system.time(res2 <- fun2(bigdf, "A"))
> >     user  system elapsed
> >    1.898   0.403   2.302
> >>
> >> system.time(res3 <- fun3(bigdf, "A"))
> >     user  system elapsed
> >    0.187   0.048   0.235
> >
> > ## 10 times faster!
> >
> >>
> >> all.equal(res1,res2)
> > [1] TRUE
> >> all.equal(res1,res3)
> > [1] TRUE
> >
> >
> > NB: I freely admit that Eric's original solution may well be perfectly
> > adequate, and the speed improvement is pointless. In that case, maybe
> > this is at least somewhat instructive for someone.
> >
> > Nevertheless, I would welcome further suggestions for improvement, as
> > I suspect my "fancy" approach is still a ways from what one can do (in
> > R code, without resorting to C++).
>
> fun4 <- function(df, what)
> {
>    as.integer(rowSums(df == what, na.rm = TRUE) > 0)
> }
>
> The function above works for data.frame and matrix inputs as well. It is
> slower than fun3() if 'df' is a data.frame, but is faster if 'df' is a
> matrix (which is a more efficient representation of the data if it
> contains only character columns).
>
> A note to Ana: 'df' is the name of a function in R (see ?stats::df); not
> a perfect choice for a variable name.
>
> Cheers,
> Denes
>
>
> >
> > Cheers,
> > Bert
> >
> >
> > Bert Gunter
> >
> > "The trouble with having an open mind is that people keep coming along
> > and sticking things into it."
> > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >
> > Bert Gunter
> >
> > "The trouble with having an open mind is that people keep coming along
> > and sticking things into it."
> > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >
> >
> > On Mon, Jul 29, 2019 at 12:38 PM Eric Berger <ericjberger using gmail.com> wrote:
> >>
> >> Read the help for apply and %in%
> >>
> >> ?apply
> >> ?%in%
> >>
> >>
> >> Sent from my iPhone
> >>
> >>> On 29 Jul 2019, at 22:23, Ana Marija <sokovic.anamarija using gmail.com> wrote:
> >>>
> >>> Thank you so much! Just to confirm here MARGIN=1 indicates that "A" should appear at least once per row?
> >>>
> >>>> On Mon, Jul 29, 2019 at 1:53 PM Eric Berger <ericjberger using gmail.com> wrote:
> >>>> df$case <- apply(df,MARGIN = 1,function(v) { as.integer("A" %in% v) })
> >>>>
> >>>>
> >>>>> On Mon, Jul 29, 2019 at 9:02 PM Ana Marija <sokovic.anamarija using gmail.com> wrote:
> >>>>> sorry my bad, here is the edited version:
> >>>>>
> >>>>> so the data frame is this:
> >>>>>
> >>>>> df=data.frame(
> >>>>>    eye_problemsdisorders_f6148_0_1=c("A","C","D",NA,"D","A","C",NA,"B","A"),
> >>>>>    eye_problemsdisorders_f6148_0_2=c("B","C",NA,"A","C","B",NA,NA,"A","D"),
> >>>>>    eye_problemsdisorders_f6148_0_3=c("C","A","D","D","B","A",NA,NA,"A","B"),
> >>>>>    eye_problemsdisorders_f6148_0_4=c("D","D",NA,"B","A","C",NA,"C","A","B"),
> >>>>>    eye_problemsdisorders_f6148_0_5=c("C","C",NA,"D","B","C",NA,"D","D","B")
> >>>>>
> >>>>> and I would need to put inside the column which would be named "case" and values inside would be: 1,1,0,1,1,1,0,0,1,1
> >>>>>
> >>>>> so "case" column is where value "A" can be found in any column.
> >>>>>
> >>>>>> On Mon, Jul 29, 2019 at 12:53 PM Eric Berger <ericjberger using gmail.com> wrote:
> >>>>>> You may have a typo/misstatement in your question.
> >>>>>> You define a data frame with 5 columns, each of which has 10 elements, so your data frame has dimensions 10 x 5.
> >>>>>> Then you request a new COLUMN which will have only 5 elements, which is not allowed. All columns of a data frame
> >>>>>> must have the same length.
> >>>>>>
> >>>>>>> On Mon, Jul 29, 2019 at 8:42 PM Ana Marija <sokovic.anamarija using gmail.com> wrote:
> >>>>>>> I have data frame which looks like this:
> >>>>>>>
> >>>>>>> df=data.frame(
> >>>>>>>    eye_problemsdisorders_f6148_0_1=c(A,C,D,NA,D,A,C,NA,B,A),
> >>>>>>>    eye_problemsdisorders_f6148_0_2=c(B,C,NA,A,C,B,NA,NA,A,D),
> >>>>>>>    eye_problemsdisorders_f6148_0_3=c(C,A,D,D,B,A,NA,NA,A,B),
> >>>>>>>    eye_problemsdisorders_f6148_0_4=c(D,D,NA,B,A,C,NA,C,A,B),
> >>>>>>>    eye_problemsdisorders_f6148_0_5=c(C,C,NA,D,B,C,NA,D,D,B))
> >>>>>>>
> >>>>>>> In reality I have much more columns and they don't always match
> >>>>>>> "eye_problemsdisorders_f6148" this string, and there is much more rows.
> >>>>>>>
> >>>>>>> What I would like to do is create a new column, say named "case" where I
> >>>>>>> would have value "1" for every row where string "A" appears at least once
> >>>>>>> in any column, if not the value would be "0". So in the above example
> >>>>>>> column "case" would have these values: 1,1,1,1,0
> >>>>>>> Thanks
> >>>>>>> Ana
> >>>>>>>
> >>>>>>>          [[alternative HTML version deleted]]
> >>>>>>>
> >>>>>>> ______________________________________________
> >>>>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >>>>>>> and provide commented, minimal, self-contained, reproducible code.
> >>
> >>          [[alternative HTML version deleted]]
> >>
> >> ______________________________________________
> >> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >



More information about the R-help mailing list