[R] How to create a new column based on the values from multiple columns which are matching a particular string?

Dénes Tóth toth@dene@ @end|ng |rom kogentum@hu
Tue Jul 30 01:33:05 CEST 2019


Hi Bert,

see inline.

On 7/30/19 1:12 AM, Bert Gunter wrote:
> While Eric's solution is correct( mod "corner" cases like all NA's in
> a row), it can be made considerably more efficient.
> 
> One minor improvement can be made by using the idiom
> any(x == "A")
> instead of matching via %in% for the simple case of matching just a
> single value.
> 
> However, a considerable improvement can be made by getting fancy,
> taking advantage of do.call() and the pmax() function to mostly
> vectorize the calculation. Here are the details and timing on a large
> data frame.
> 
> (Note: I removed the names in the %in% approach for simplicity. It has
> almost no effect on timings.
> I also moved the as.integer() call out of the function so that it is
> called only once at the end, which improves efficiency a bit)
> 
> 1. Eric's original:
> fun1 <-function(df,what)
> {
>    as.integer(unname(apply(df,MARGIN = 1,function(v) { what %in% v })))
> }
> 
> 2. Using any( x == "A") instead:
> fun2 <- function(df,what)
> {
>     as.integer(unname(apply(df,MARGIN =1, function(x)any(x == what,
> na.rm=TRUE))))
> }
> 
> 3. Getting fancy to use pmax()
> fun3 <- function(df,what)
> {
>     z <- lapply(df,function(x)as.integer((x==what)))
>     do.call(pmax,c(z,na.rm=TRUE))
> }
> 
> Here are the timings:
> 
>> bigdf <- df[rep(1:10,1e4), rep(1:5, 50)]
>> dim(bigdf)
> [1] 100000    250
> 
>> system.time(res1 <- fun1(bigdf, "A"))
>     user  system elapsed
>    2.204   0.432   2.637
>>
>> system.time(res2 <- fun2(bigdf, "A"))
>     user  system elapsed
>    1.898   0.403   2.302
>>
>> system.time(res3 <- fun3(bigdf, "A"))
>     user  system elapsed
>    0.187   0.048   0.235
> 
> ## 10 times faster!
> 
>>
>> all.equal(res1,res2)
> [1] TRUE
>> all.equal(res1,res3)
> [1] TRUE
> 
> 
> NB: I freely admit that Eric's original solution may well be perfectly
> adequate, and the speed improvement is pointless. In that case, maybe
> this is at least somewhat instructive for someone.
> 
> Nevertheless, I would welcome further suggestions for improvement, as
> I suspect my "fancy" approach is still a ways from what one can do (in
> R code, without resorting to C++).

fun4 <- function(df, what)
{
   as.integer(rowSums(df == what, na.rm = TRUE) > 0)
}

The function above works for data.frame and matrix inputs as well. It is 
slower than fun3() if 'df' is a data.frame, but is faster if 'df' is a 
matrix (which is a more efficient representation of the data if it 
contains only character columns).

A note to Ana: 'df' is the name of a function in R (see ?stats::df); not 
a perfect choice for a variable name.

Cheers,
Denes


> 
> Cheers,
> Bert
> 
> 
> Bert Gunter
> 
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> 
> Bert Gunter
> 
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> 
> 
> On Mon, Jul 29, 2019 at 12:38 PM Eric Berger <ericjberger using gmail.com> wrote:
>>
>> Read the help for apply and %in%
>>
>> ?apply
>> ?%in%
>>
>>
>> Sent from my iPhone
>>
>>> On 29 Jul 2019, at 22:23, Ana Marija <sokovic.anamarija using gmail.com> wrote:
>>>
>>> Thank you so much! Just to confirm here MARGIN=1 indicates that "A" should appear at least once per row?
>>>
>>>> On Mon, Jul 29, 2019 at 1:53 PM Eric Berger <ericjberger using gmail.com> wrote:
>>>> df$case <- apply(df,MARGIN = 1,function(v) { as.integer("A" %in% v) })
>>>>
>>>>
>>>>> On Mon, Jul 29, 2019 at 9:02 PM Ana Marija <sokovic.anamarija using gmail.com> wrote:
>>>>> sorry my bad, here is the edited version:
>>>>>
>>>>> so the data frame is this:
>>>>>
>>>>> df=data.frame(
>>>>>    eye_problemsdisorders_f6148_0_1=c("A","C","D",NA,"D","A","C",NA,"B","A"),
>>>>>    eye_problemsdisorders_f6148_0_2=c("B","C",NA,"A","C","B",NA,NA,"A","D"),
>>>>>    eye_problemsdisorders_f6148_0_3=c("C","A","D","D","B","A",NA,NA,"A","B"),
>>>>>    eye_problemsdisorders_f6148_0_4=c("D","D",NA,"B","A","C",NA,"C","A","B"),
>>>>>    eye_problemsdisorders_f6148_0_5=c("C","C",NA,"D","B","C",NA,"D","D","B")
>>>>>
>>>>> and I would need to put inside the column which would be named "case" and values inside would be: 1,1,0,1,1,1,0,0,1,1
>>>>>
>>>>> so "case" column is where value "A" can be found in any column.
>>>>>
>>>>>> On Mon, Jul 29, 2019 at 12:53 PM Eric Berger <ericjberger using gmail.com> wrote:
>>>>>> You may have a typo/misstatement in your question.
>>>>>> You define a data frame with 5 columns, each of which has 10 elements, so your data frame has dimensions 10 x 5.
>>>>>> Then you request a new COLUMN which will have only 5 elements, which is not allowed. All columns of a data frame
>>>>>> must have the same length.
>>>>>>
>>>>>>> On Mon, Jul 29, 2019 at 8:42 PM Ana Marija <sokovic.anamarija using gmail.com> wrote:
>>>>>>> I have data frame which looks like this:
>>>>>>>
>>>>>>> df=data.frame(
>>>>>>>    eye_problemsdisorders_f6148_0_1=c(A,C,D,NA,D,A,C,NA,B,A),
>>>>>>>    eye_problemsdisorders_f6148_0_2=c(B,C,NA,A,C,B,NA,NA,A,D),
>>>>>>>    eye_problemsdisorders_f6148_0_3=c(C,A,D,D,B,A,NA,NA,A,B),
>>>>>>>    eye_problemsdisorders_f6148_0_4=c(D,D,NA,B,A,C,NA,C,A,B),
>>>>>>>    eye_problemsdisorders_f6148_0_5=c(C,C,NA,D,B,C,NA,D,D,B))
>>>>>>>
>>>>>>> In reality I have much more columns and they don't always match
>>>>>>> "eye_problemsdisorders_f6148" this string, and there is much more rows.
>>>>>>>
>>>>>>> What I would like to do is create a new column, say named "case" where I
>>>>>>> would have value "1" for every row where string "A" appears at least once
>>>>>>> in any column, if not the value would be "0". So in the above example
>>>>>>> column "case" would have these values: 1,1,1,1,0
>>>>>>> Thanks
>>>>>>> Ana
>>>>>>>
>>>>>>>          [[alternative HTML version deleted]]
>>>>>>>
>>>>>>> ______________________________________________
>>>>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>          [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list