[R] dplyr - counting a number of specific values in each column - for all columns at once

David Winsemius dwinsemius at comcast.net
Tue Jun 16 22:42:41 CEST 2015


On Jun 16, 2015, at 11:18 AM, Clint Bowman wrote:

> Thanks, Dimitri.  Burt is the real wizard here--I'll bet he can conjure up an elegant solution.

This would be base method:

> by( md[-4]==5, md[4], colSums)
device: 1
a b c 
1 2 0 
----------------------------------------------------- 
device: 2
a b c 
1 1 0 
----------------------------------------------------- 
device: 3
a b c 
1 0 2 

You could adapt that to use myvars:

> by(md[myvars]==5, md[!names(md) %in% myvars],colSums)
device: 1
a b c 
1 2 0 
----------------------------------------------------- 
device: 2
a b c 
1 1 0 
----------------------------------------------------- 
device: 3
a b c 
1 0 2 

And if you want them smushed into a matrix then use rbind:

> do.call( rbind, by(md[myvars]==5, md[!names(md) %in% myvars],colSums))
  a b c
1 1 2 0
2 1 1 0
3 1 0 2

> 
> For me, just reaching a desired endpoint is enough<g>.
> 
> Clint
> 
> Clint Bowman			INTERNET:	clint at ecy.wa.gov
> Air Quality Modeler		INTERNET:	clint at math.utah.edu
> Department of Ecology		VOICE:		(360) 407-6815
> PO Box 47600			FAX:		(360) 407-7534
> Olympia, WA 98504-7600
> 
>        USPS:           PO Box 47600, Olympia, WA 98504-7600
>        Parcels:        300 Desmond Drive, Lacey, WA 98503-1274
> 
> On Tue, 16 Jun 2015, Dimitri Liakhovitski wrote:
> 
>> Thank you, Clint.
>> That's the thing: it's relatively easy to do it in base, but the
>> resulting code is not THAT simple.
>> I thought dplyr would make it easy...
>> 
>> On Tue, Jun 16, 2015 at 2:06 PM, Clint Bowman <clint at ecy.wa.gov> wrote:
>>> May want to add headers but the following provides the device number with
>>> each set fo sums:
>>> 
>>> for (dev in (unique(md$device)))
>>> {cat(colSums(subset(md,md$device==dev)==5,na.rm=T),dev,"\n")}
>>> 
>>> Clint Bowman                    INTERNET:       clint at ecy.wa.gov
>>> Air Quality Modeler             INTERNET:       clint at math.utah.edu
>>> Department of Ecology           VOICE:          (360) 407-6815
>>> PO Box 47600                    FAX:            (360) 407-7534
>>> Olympia, WA 98504-7600
>>> 
>>>        USPS:           PO Box 47600, Olympia, WA 98504-7600
>>>        Parcels:        300 Desmond Drive, Lacey, WA 98503-1274
>>> 
>>> On Tue, 16 Jun 2015, Dimitri Liakhovitski wrote:
>>> 
>>>> Except, of course, Bert, that you forgot that it had to be done by
>>>> device. Your solution ignores the device.
>>>> 
>>>> md <- data.frame(a = c(3,5,4,5,3,5), b = c(5,5,5,4,4,1), c =
>>>> c(1,3,4,3,5,5),
>>>>     device = c(1,1,2,2,3,3))
>>>> myvars = c("a", "b", "c")
>>>> md[2,3] <- NA
>>>> md[4,1] <- NA
>>>> md
>>>> vapply(md[myvars], function(x) sum(x==5,na.rm=TRUE),1L)
>>>> 
>>>> But the result should be by device.
>>>> 
>>>> On Tue, Jun 16, 2015 at 1:56 PM, Dimitri Liakhovitski
>>>> <dimitri.liakhovitski at gmail.com> wrote:
>>>>> 
>>>>> Thank you, Bert.
>>>>> I'll be honest - I am just learning dplyr and was wondering if one
>>>>> could do it in dplyr.
>>>>> But of course your solution is perfect...
>>>>> 
>>>>> On Tue, Jun 16, 2015 at 1:50 PM, Bert Gunter <bgunter.4567 at gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>> Well, dplyr seems a bit of overkill as it's so simple with plain old
>>>>>> vapply() in base R :
>>>>>> 
>>>>>> 
>>>>>>> dat <- data.frame (a=sample(1:5,10,rep=TRUE),
>>>>>> 
>>>>>> +                    b=sample(3:7,10,rep=TRUE),
>>>>>> +                    g = sample(7:9,10,rep=TRUE))
>>>>>> 
>>>>>>> vapply(dat,function(x)sum(x==5,na.rm=TRUE),1L)
>>>>>> 
>>>>>> 
>>>>>> a b g
>>>>>> 5 4 0
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Cheers,
>>>>>> Bert
>>>>>> 
>>>>>> Bert Gunter
>>>>>> 
>>>>>> "Data is not information. Information is not knowledge. And knowledge is
>>>>>> certainly not wisdom."
>>>>>>   -- Clifford Stoll
>>>>>> 
>>>>>> On Tue, Jun 16, 2015 at 10:24 AM, Dimitri Liakhovitski
>>>>>> <dimitri.liakhovitski at gmail.com> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> Hello!
>>>>>>> 
>>>>>>> I have a data frame:
>>>>>>> 
>>>>>>> md <- data.frame(a = c(3,5,4,5,3,5), b = c(5,5,5,4,4,1), c =
>>>>>>> c(1,3,4,3,5,5),
>>>>>>>      device = c(1,1,2,2,3,3))
>>>>>>> myvars = c("a", "b", "c")
>>>>>>> md[2,3] <- NA
>>>>>>> md[4,1] <- NA
>>>>>>> md
>>>>>>> 
>>>>>>> I want to count number of 5s in each column - by device. I can do it
>>>>>>> like
>>>>>>> this:
>>>>>>> 
>>>>>>> library(dplyr)
>>>>>>> group_by(md, device) %>%
>>>>>>> summarise(counts.a = sum(a==5, na.rm = T),
>>>>>>>          counts.b = sum(b==5, na.rm = T),
>>>>>>>          counts.c = sum(c==5, na.rm = T))
>>>>>>> 
>>>>>>> However, in real life I'll have tons of variables (the length of
>>>>>>> 'myvars' can be very large) - so that I can't specify those counts.a,
>>>>>>> counts.b, etc. manually - dozens of times.
>>>>>>> 
>>>>>>> Does dplyr allow to run the count of 5s on all 'myvars' columns at
>>>>>>> once?
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Dimitri Liakhovitski
>>>>>>> 
>>>>>>> ______________________________________________
>>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>> PLEASE do read the posting guide
>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Dimitri Liakhovitski
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Dimitri Liakhovitski
>>>> 
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>> 
>>> 
>> 
>> 
>> 
>> -- 
>> Dimitri Liakhovitski
>> 
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA



More information about the R-help mailing list