[R] Create new data frame with conditional sums

Sun Oct 15 16:39:59 CEST 2023

Dear Bert,

On 2023-10-15 10:29 a.m., Bert Gunter wrote:

> 
> 
> Under the hood, sapply() is also a loop (at the interpreted level). As
> is lapply(), etc.

Indeed. I think that there's a neurotic aversion to using loops in R in 
favour of functions in the apply family. Sometimes one approach leads to 
a more transparent solution, sometimes the other.

Many years ago, Uwe Ligges and I wrote an R News (the predecessor of the 
R Journal) "Help Desk" article about loops in R, available at 
<https://cran.r-project.org/doc/Rnews/Rnews_2008-1.pdf>. Some of the 
advice there is probably outdated due to internal changes in R, but much 
of it is probably still valide.

BTW, I don't think that my original posting ever made it to r-help. It 
was intercepted by r-help's filters and I eventually deleted it because 
so many other people had by then responded. Let's hope that this message 
makes it to the list.

Best,
  John

> 
> -- Bert
> 
> On Sun, Oct 15, 2023 at 2:34 AM Jason Stout, M.D. <jason.stout using duke.edu> wrote:
>>
>> That's very helpful and instructive, thank you!
>>
>> Jason Stout, MD, MHS
>> Box 102359-DUMC
>> Durham, NC 27710
>> FAX 919-681-7494
>> ________________________________
>> From: John Fox <jfox using mcmaster.ca>
>> Sent: Saturday, October 14, 2023 10:13 AM
>> To: Jason Stout, M.D. <jason.stout using duke.edu>
>> Cc: r-help using r-project.org <r-help using r-project.org>
>> Subject: Re: [R] Create new data frame with conditional sums
>>
>> Dear Jason,
>>
>> I don't think that there's anything wrong with using a loop to solve
>> this problem, but it's generally a good idea to pre-allocate space for
>> the result rather than build it up one value at a time, which may cause
>> unnecessary copying of the object.
>>
>> Here are three solutions:
>>
>> f1 <- function(Cutoff, Pct, Totpop){
>>     Pop <- numeric(0)
>>     for (i in seq_along(Cutoff))
>>       Pop[i] <- sum(Totpop[Pct >= Cutoff[i]])
>>     cbind(Cutoff, Pop)
>> }
>>
>> f2 <- function(Cutoff, Pct, Totpop){
>>     Pop <- numeric(length(Cutoff))
>>     for (i in seq_along(Cutoff))
>>       Pop[i] <- sum(Totpop[Pct >= Cutoff[i]])
>>     cbind(Cutoff, Pop)
>> }
>>
>> f3 <- function(Cutoff, Pct, Totpop){
>>     Pop <- sapply(Cutoff, function(c) sum(Totpop[Pct >= c]))
>>     cbind(Cutoff, Pop)
>> }
>>
>> The first is similar to yours; the second pre-allocates space for the
>> result but still uses a loop; and the third avoids the loop. All produce
>> the same result, for example,
>>
>>   > with(dummydata, f3(seq(0, 0.15, by=0.01), Pct, Totpop))
>>         Cutoff   Pop
>>    [1,]   0.00 43800
>>    [2,]   0.01 43800
>>    [3,]   0.02 39300
>>    [4,]   0.03 39300
>>    [5,]   0.04 31000
>>    [6,]   0.05 26750
>>    [7,]   0.06 22750
>>    [8,]   0.07 17800
>>    [9,]   0.08 12700
>> [10,]   0.09 12700
>> [11,]   0.10  8000
>> [12,]   0.11  8000
>> [13,]   0.12  8000
>> [14,]   0.13  3900
>> [15,]   0.14  3900
>> [16,]   0.15  3900
>>
>> Here are some timings:
>>
>>   > microbenchmark::microbenchmark(
>> +   preallocate=with(dummydata, f2(seq(0, 0.15, by=0.01),
>> +                                  Pct, Totpop)),
>> +   yourloop=with(dummydata, f1(seq(0, 0.15, by=0.01),
>> +                               Pct, Totpop)),
>> +   sapply=with(dummydata, f3(seq(0, 0.15, by=0.01),
>> +                             Pct, Totpop)),
>> +   times=1000
>> + )
>> Unit: microseconds
>>           expr    min      lq     mean  median     uq    max neval cld
>>    preallocate 13.776 14.3910 15.74195 14.9240 16.318 56.908  1000 a
>>       yourloop 15.129 15.7645 17.26809 16.3795 18.368 73.964  1000  b
>>         sapply 22.304 23.2060 25.19868 24.1080 26.814 48.544  1000   c
>>
>> So, for this very small problem, there are small but reliable
>> differences in timing among the three solutions, and the version that
>> avoids the loop is slowest. I suspect, but haven't verified, that for a
>> much larger problem, your solution would be slowest.
>>
>> I hope this helps,
>>    John
>>
>> --
>> John Fox, Professor Emeritus
>> McMaster University
>> Hamilton, Ontario, Canada
>> web: https://urldefense.com/v3/__https://www.john-fox.ca/__;!!OToaGQ!s5vzmg4dxnnS0zohDtpWBBey7cb53uSXIPTTqs5fgaz-BKlNnWzpCfBz6aP0YhCGemy-bP6xEtJEwobdDQ$
>> On 2023-10-13 4:13 p.m., Jason Stout, M.D. wrote:
>>> Caution: External email.
>>>
>>>
>>> This seems like it should be simple but I can't get it to work properly.  I'm starting with a data frame like this:
>>>
>>> Tract      Pct          Totpop
>>> 1              0.05        4000
>>> 2              0.03        3500
>>> 3              0.01        4500
>>> 4              0.12        4100
>>> 5              0.21        3900
>>> 6              0.04        4250
>>> 7              0.07        5100
>>> 8              0.09        4700
>>> 9              0.06        4950
>>> 10           0.03        4800
>>>
>>> And I want to end up with a data frame with two columns, a "Cutoff" column that is a simple sequence of equally spaced cutoffs (let's say in this case from 0-0.15 by 0.01) and a "Pop" column which equals the sum of "Totpop" in the prior data frame in which "Pct" is greater than or equal to "cutoff."  So in this toy example, this is what I want for a result:
>>>
>>>      Cutoff   Pop
>>> 1    0.00 43800
>>> 2    0.01 43800
>>> 3    0.02 39300
>>> 4    0.03 39300
>>> 5    0.04 31000
>>> 6    0.05 26750
>>> 7    0.06 22750
>>> 8    0.07 17800
>>> 9    0.08 12700
>>> 10   0.09 12700
>>> 11   0.10  8000
>>> 12   0.11  8000
>>> 13   0.12  8000
>>> 14   0.13  3900
>>> 15   0.14  3900
>>> 16   0.15  3900
>>>
>>> I can do this with a for loop but it seems there should be an easier, vectorized way that would be more efficient.  Here is a reproducible example:
>>>
>>> dummydata<-data.frame(Tract=seq(1,10,by=1),Pct=c(0.05,0.03,0.01,0.12,0.21,0.04,0.07,0.09,0.06,0.03),Totpop=c(4000,3500,4500,4100,
>>>                                                                                                                3900,4250,5100,4700,
>>>                                                                                                                4950,4800))
>>> dfrm<-data.frame(matrix(ncol=2,nrow=0,dimnames=list(NULL,c("Cutoff","Pop"))))
>>> for (i in seq(0,0.15,by=0.01)) {
>>>    temp<-sum(dummydata[dummydata$Pct>=i,"Totpop"])
>>> dfrm[nrow(dfrm)+1,]<-c(i,temp)
>>> }
>>>
>>> Jason Stout, MD, MHS
>>> Division of Infectious Diseases
>>> Dept of Medicine
>>> Duke University
>>> Box 102359-DUMC
>>> Durham, NC 27710
>>> FAX 919-681-7494
>>>
>>>
>>>           [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-help__;!!OToaGQ!s5vzmg4dxnnS0zohDtpWBBey7cb53uSXIPTTqs5fgaz-BKlNnWzpCfBz6aP0YhCGemy-bP6xEtL8RrekaA$
>>> PLEASE do read the posting guide https://urldefense.com/v3/__http://www.R-project.org/posting-guide.html__;!!OToaGQ!s5vzmg4dxnnS0zohDtpWBBey7cb53uSXIPTTqs5fgaz-BKlNnWzpCfBz6aP0YhCGemy-bP6xEtKGvEhDNw$
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>>          [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.