[R] generalization of tabulate()

Sat Oct 17 18:00:50 CEST 2009

On 16-Oct-09 11:27:06, Gabor Grothendieck wrote:
> Using the generalized inner product defined in this post:
> 
>    https://www.stat.math.ethz.ch/pipermail/r-help/2006-July/109311.html
> 
> try this:
> 
>    cbind(S, d = rowSums(inner(S, obs, identical)))

The function inner() is defined at the above URL as:

# generalized crossproduct
inner <- function(a,b=a,f=crossprod)
        apply(b,2,function(x)apply(a,2,function(y)f(x,y)))

In order for this approach to work with Robin's 'observation' = 'obs'
and 'S' matrices as illustrated by Gabor above, the function inner()
should instead be defined as

# generalized crossproduct
inner <- function(a,b=a,f=crossprod)
        apply(b,1,function(x)apply(a,1,function(y)f(x,y)))

i.e. it crosses all rows (dim=1) of 'obs' with all rows (dim=1) of S;
using "2" instead of "1" does it by columns. Indeed, a possibly useful
more general definition might be

# generalized crossproduct
inner <- function(a,b=a,f=crossprod,dim1=1,dim2=1)
        apply(b,dim2,function(x)apply(a,dim1,function(y)f(x,y)))

allowing either dimension of a to be crossed with either dimension of b.

Ted.

> On Fri, Oct 16, 2009 at 4:29 AM, Robin Hankin <rksh1 at cam.ac.uk> wrote:
>> Hi
>>
>> I want a generalization of tabulate() which works on rows of a matrix.
>> Suppose I have an integer matrix 'observation':
>>
>>> observation
>>
>> y1 y2 y3
>> 1 4 0
>> 1 4 0
>> 2 0 3
>> 4 1 0
>> 0 5 0
>> 0 1 4
>> 2 0 3
>>
>> Each row corresponds to a (multivariate) observation. _Note that the
>> first two rows are identical: this means that data "c(1,4,0)" was
>> observed twice.
>>
>> Now suppose I can list the sample space:
>>
>>> S
>> _ _ _ _ [1,] 5 0 0
>> [2,] 4 1 0
>> [3,] 3 2 0
>> [4,] 2 3 0
>> [5,] 1 4 0
>> [6,] 0 5 0
>> [7,] 4 0 1
>> [8,] 3 1 1
>> [9,] 2 2 1
>> [10,] 1 3 1
>> [11,] 0 4 1
>> [12,] 3 0 2
>> [13,] 2 1 2
>> [14,] 1 2 2
>> [15,] 0 3 2
>> [16,] 2 0 3
>> [17,] 1 1 3
>> [18,] 0 2 3
>> [19,] 1 0 4
>> [20,] 0 1 4
>> [21,] 0 0 5
>>
>> (thus each row corresponds to a point in my sample space).
>>
>> Now what I need to do is to construct a new matrix, which uses the
>> 'observation' matrix above, which is a sort of table:
>>
>>> desired
>>
>> _ _ y1 y2 y3 d
>> [1,] 5 0 0 0
>> [2,] 4 1 0 1
>> [3,] 3 2 0 0
>> [4,] 2 3 0 0
>> [5,] 1 4 0 2
>> [6,] 0 5 0 1
>> [7,] 4 0 1 0
>> [8,] 3 1 1 0
>> [9,] 2 2 1 0
>> [10,] 1 3 1 0
>> [11,] 0 4 1 0
>> [12,] 3 0 2 0
>> [13,] 2 1 2 0
>> [14,] 1 2 2 0
>> [15,] 0 3 2 0
>> [16,] 2 0 3 2
>> [17,] 1 1 3 0
>> [18,] 0 2 3 0
>> [19,] 1 0 4 0
>> [20,] 0 1 4 1
>> [21,] 0 0 5 0
>>
>>
>> Thus the 'd' column counts the number of times that each row occurs in
>> variable 'observation'. _So desired[5,4]=2 because the observation
>> corresponding to desired[5,1:3] (viz c(1,4,0)) occurred twice. _And
>> desired[1,4]=0 because the observation corresponding to desired[1,1:3]
>> (viz c(5,0,0)) did not occur once (it was not observed).
>>
>> In my application I have dim(S) ~= c(5,4e6).
>>
>> I've tried merge(), stack(), _reshape(), but the best I can do
>> is the (derisory):
>>
>> require(partitions)
>>
>>
>> obs <- matrix(as.integer(c(
>> _ _ _ _ _ _ _ 1, 4, 0,
>> _ _ _ _ _ _ _ 1, 4, 0,
>> _ _ _ _ _ _ _ 2, 0, 3,
>> _ _ _ _ _ _ _ 4, 1, 0,
>> _ _ _ _ _ _ _ 0, 5, 0,
>> _ _ _ _ _ _ _ 0, 1, 4,
>> _ _ _ _ _ _ _ 2, 0, 3
>> _ _ _ _ _ _ _ )),ncol=3,byrow=TRUE)
>>
>> S <- t(compositions(5,3))
>> d <- rep(0,nrow(S))
>>
>>
>> for(i in seq_len(nrow(obs))){
>> _for(j in seq_len(nrow(S))){
>> _ if(all(obs[i,,drop=TRUE] == S[j,,drop=TRUE])){
>> _ _ d[j] <- d[j]+1
>> _ }
>> _}
>> }
>>
>> S <- cbind(S,d)
>>
>>
>> Anyone got anything better before I try C?
>>
>>
>> --
>> Robin K. S. Hankin
>> Uncertainty Analyst
>> University of Cambridge
>> 19 Silver Street
>> Cambridge CB3 9EP
>> 01223-764877
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 17-Oct-09                                       Time: 17:00:48
------------------------------ XFMail ------------------------------