[R] Effeciently sum 3d table

David Winsemius dwinsemius at comcast.net
Tue Apr 17 00:01:43 CEST 2012


On Apr 16, 2012, at 4:32 PM, Bert Gunter wrote:

> David:
>
> Here is a comparison of the gains to be made by vectorization (again,
> assuming I have interpreted your query correctly)
>
> ## create a list of arrays
>> z <- lapply(seq_len(10000),function(i)array(runif(24),dim=2:4))
> ## Using an apply type approach
>> system.time(ans1 <- array(do.call(mapply,c(sum,z)),dim=2:4))
>   user  system elapsed
>   0.62    0.00    0.62
> ## vectorizing via rowSums and cbind
>> system.time(ans2 <-array(rowSums(do.call(cbind,z)),dim=2:4))
>   user  system elapsed
>   0.02    0.00    0.02
>> identical(ans1,ans2)
> [1] TRUE
>

It's an example as well for the possibility that different OSes may  
perform differently. My Mac (an early 2008 model) is nowhere nearly as  
efficient with the second solution, despite being the the same  
ballpark with the first:

 > system.time(ans1 <- array(do.call(mapply,c(sum,z)),dim=2:4))
    user  system elapsed
   0.841   0.007   0.851
 > system.time(ans2 <-array(rowSums(do.call(cbind,z)),dim=2:4))
    user  system elapsed
   0.132   0.003   0.145

And on my system ....  the Reduce strategy is fastest:

 > system.time(ans3 <- Reduce("+", z) )
    user  system elapsed
   0.129   0.001   0.134

And ...the Reduce() strategy would preserve other object attributes,  
something I'm quite sure the re-dimensioning of rowSums(cbind(.))  
could not preserve.

  L <- list( table(a, sample(a)) ,
             table(a, sample(a)),
             table(a, sample(a)),
             table(a, sample(a)),
             table(a, sample(a)) )

  str(Reduce("+", L) )
  'table' int [1:3, 1:3] 1 1 3 4 0 1 0 4 1
  - attr(*, "dimnames")=List of 2
   ..$ a: chr [1:3] "a" "b" "c"
   ..$  : chr [1:3] "a" "b" "c"

  str( array(rowSums(do.call(cbind,L)),dim=c(3,3))  )
  num [1:3, 1:3] 5 5 5 5 5 5 5 5 5


-- David.


> Cheers,
> Bert
>
>
>
> On Mon, Apr 16, 2012 at 1:19 PM, David A Vavra <davavra at verizon.net>  
> wrote:
>> Thanks Bill,
>>
>>
>>
>> For reasons that aren't important here, I must start from a list.  
>> Computing
>> the sum while generating the tables may be a solution but it means  
>> doing
>> something in one piece of code that is unrelated to the surrounding  
>> code.
>> Bad practice where I'm from. If it's needed it's needed but if I  
>> can avoid
>> doing so, I will.
>>
>>
>>
>> I haven't done any timing but because of the extra operations of  
>> get and
>> assign, the non-loop implementation will likely suffer. It seems  
>> you have
>> shown this to be true.
>>
>>
>>
>> DAV
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: William Dunlap [mailto:wdunlap at tibco.com]
>> Sent: Monday, April 16, 2012 3:26 PM
>> To: David A Vavra; 'Bert Gunter'
>> Cc: r-help at r-project.org
>> Subject: RE: [R] Effeciently sum 3d table
>>
>>
>>
>>> Example in partial code:
>>
>>>
>>
>>> Env <- CreatEnv() # my own function
>>
>>> Assign('final',T1-T1,envir=env)
>>
>>> L<-listOfTables
>>
>>>
>>
>>> lapply(L,function(t) {
>>
>>>     final <- get('final',envir=env) + t
>>
>>>     assign('final',final,envir=env)
>>
>>>     NULL
>>
>>> })
>>
>>
>>
>> First, finish writing that code so it runs and you can make sure its
>>
>> output is ok:
>>
>>
>>
>> L <- lapply(1:50000, function(i) array(i:(i+3), c(2,2))) # list of  
>> 50,000
>> 2x2 matrices
>>
>> env <- new.env()
>>
>> assign('final', L[[1]] - L[[1]], envir=env)
>>
>> junk <- lapply(L, function(t) {
>>
>>     final <- get('final', envir=env) + t
>>
>>     assign('final', final, envir=env)
>>
>>     NULL
>>
>> })
>>
>> get('final', envir=env)
>>
>> #            [,1]       [,2]
>>
>> # [1,] 1250025000 1250125000
>>
>> # [2,] 1250075000 1250175000
>>
>>> sum( (2:50001) ) # should be final[2,1]
>>
>> # [1] 1250075000
>>
>>
>>
>> You asked for something less "clunky".
>>
>> You are fighting the system by using get() and assign(), just use
>>
>> ordinary expression syntax to get and set variables:
>>
>> final <- L[[1]]
>>
>> for(i in seq_along(L)[-1]) final <- final + L[[i]]
>>
>> final
>>
>> #           [,1]       [,2]
>>
>> # [1,] 1250025000 1250125000
>>
>> # [2,] 1250075000 1250175000
>>
>>
>>
>> The former took 0.22 seconds on my machine, the latter 0.06.
>>
>>
>>
>> You don't have to compute the whole list of matrices before
>>
>> doing the sum, just add to the current sum when you have
>>
>> computed one matrix and then forget about it.
>>
>>
>>
>> Bill Dunlap
>>
>> Spotfire, TIBCO Software
>>
>> wdunlap tibco.com
>>
>>
>>
>>
>>
>>> -----Original Message-----
>>
>>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org 
>>> ]
>> On Behalf
>>
>>> Of David A Vavra
>>
>>> Sent: Monday, April 16, 2012 11:35 AM
>>
>>> To: 'Bert Gunter'
>>
>>> Cc: r-help at r-project.org
>>
>>> Subject: Re: [R] Effeciently sum 3d table
>>
>>>
>>
>>> Thanks Gunter,
>>
>>>
>>
>>> I mean what I think is the normal definition of 'sum' as in:
>>
>>>    T1 + T2 + T3 + ...
>>
>>> It never occurred to me that there would be a question.
>>
>>>
>>
>>> I have gotten the impression that a for loop is very inefficient.  
>>> Whenever
>> I
>>
>>> change them to lapply calls there is a noticeable improvement in  
>>> run time
>>
>>> for whatever reason. The problem with lapply here is that I  
>>> effectively
>> need
>>
>>> a global table to hold the final sum. lapply also  wants to return a
>> value.
>>
>>>
>>
>>> You may be correct that in the long run, the loop is the best.  
>>> There's a
>> lot
>>
>>> of extraneous memory wastage holding all of the tables in a list  
>>> as well
>> as
>>
>>> the return 'values'.
>>
>>>
>>
>>> As an alternate and given a pre-existing list of tables, I was  
>>> thinking of
>>
>>> creating a temporary environment to hold the final result so it  
>>> could be
>>
>>> passed globally to each lapply execution level but that seems  
>>> clunky and
>>
>>> wasteful as well.
>>
>>>
>>
>>> Example in partial code:
>>
>>>
>>
>>> Env <- CreatEnv() # my own function
>>
>>> Assign('final',T1-T1,envir=env)
>>
>>> L<-listOfTables
>>
>>>
>>
>>> lapply(L,function(t) {
>>
>>>     final <- get('final',envir=env) + t
>>
>>>     assign('final',final,envir=env)
>>
>>>     NULL
>>
>>> })
>>
>>>
>>
>>> But I was hoping for a more elegant and hopefully more efficient  
>>> solution.
>>
>>> Greg's suggestion for using reduce seems in order but as yet I'm
>> unfamiliar
>>
>>> with the function.
>>
>>>
>>
>>> DAV
>>
>>>
>>
>>>
>>
>>>
>>
>>> -----Original Message-----
>>
>>> From: Bert Gunter [mailto:gunter.berton at gene.com]
>>
>>> Sent: Monday, April 16, 2012 12:42 PM
>>
>>> To: Greg Snow
>>
>>> Cc: David A Vavra; r-help at r-project.org
>>
>>> Subject: Re: [R] Effeciently sum 3d table
>>
>>>
>>
>>> Define "sum" . Do you mean you want to get a single sum for each
>>
>>> array? -- get marginal sums for each array? -- get a single array in
>>
>>> which each value is the sum of all the individual values at the
>>
>>> position?
>>
>>>
>>
>>> Due thought and consideration for those trying to help by  
>>> formulating
>>
>>> your query carefully and concisely vastly increases the chance of
>>
>>> getting a useful answer. See the posting guide -- this is a skill  
>>> that
>>
>>> needs to be learned and the guide is quite helpful. And I must
>>
>>> acknowledge that it is a skill that I also have not yet mastered.
>>
>>>
>>
>>> Concerning your query, I would only note that the two responses from
>>
>>> Greg and Petr that you received are unlikely to be significantly
>>
>>> faster than just using loops, since both are still essentially  
>>> looping
>>
>>> at the interpreted level. Whether either give you what you want, I  
>>> do
>>
>>> not know.
>>
>>>
>>
>>> -- Bert
>>
>>>
>>
>>> On Mon, Apr 16, 2012 at 8:53 AM, Greg Snow <538280 at gmail.com> wrote:
>>
>>>> Look at the Reduce function.
>>
>>>>
>>
>>>> On Mon, Apr 16, 2012 at 8:28 AM, David A Vavra  
>>>> <davavra at verizon.net>
>>
>>> wrote:
>>
>>>>> I have a large number of 3d tables that I wish to sum
>>
>>>>> Is there an efficient way to do this? Or perhaps a function I  
>>>>> can call?
>>
>>>>>
>>
>>>>> I tried using do.call("sum",listoftables) but that returns a  
>>>>> single
>>
>>> value.
>>
>>>>>
>>
>>>>> So far, it seems only a loop will do the job.
>>
>>>>>
>>
>>>>>
>>
>>>>> TIA,
>>
>>>>> DAV
>>
>>>
>>
>>>
>>
>>> --
>>
>>>
>>
>>> Bert Gunter
>>
>>> Genentech Nonclinical Biostatistics
>>
>>>
>>
>>> Internal Contact Info:
>>
>>> Phone: 467-7374
>>
>>> Website:
>>
>>>
>> http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biost
>>
>>> atistics/pdb-ncb-home.htm
>>
>>>
>>
>>> ______________________________________________
>>
>>> R-help at r-project.org mailing list
>>
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>
>>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>>
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>>        [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
>
> -- 
>
> Bert Gunter
> Genentech Nonclinical Biostatistics
>
> Internal Contact Info:
> Phone: 467-7374
> Website:
> http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT



More information about the R-help mailing list