[R] generalization of tabulate()

Bert Gunter gunter.berton at gene.com
Fri Oct 16 18:24:44 CEST 2009


If I have correctly understood what Robin wants, I don't think Gabor's
elaborate solution is necessary (though I grant that it may be more
general). It's also slow due to the "apply's". A more straightforward and
faster approach is to convert the rows to individual character strings via
paste() and then use table() and match() to get your counts.

## Example data 
tbl <- matrix(sample(0:9,24,rep=TRUE),ncol=3)  ## sample space
obs <- tbl[sample(1:6,12,rep=TRUE),]   ## sample

## Now the code
## Use paste to convert each row into a character string
tblRow <- do.call(paste,c(data.frame(tbl),sep="."))
obsRow <-  do.call(paste,c(data.frame(obs),sep="."))

d <- rep(0,nrow(tbl)) ## initialize vector of counts
counts <- table(obsRow) ## Let (the fast) table() do the work
d[match(names(counts),tblRow)] <- counts ## vector of counts

## here are results from a sample run:

> tbl
     [,1] [,2] [,3]
[1,]    6    3    4
[2,]    0    6    0
[3,]    4    2    7
[4,]    0    3    3
[5,]    9    0    2
[6,]    7    8    9
[7,]    7    5    3
[8,]    7    1    8
> obs
      [,1] [,2] [,3]
 [1,]    9    0    2
 [2,]    0    3    3
 [3,]    6    3    4
 [4,]    9    0    2
 [5,]    7    8    9
 [6,]    0    6    0
 [7,]    4    2    7
 [8,]    0    6    0
 [9,]    9    0    2
[10,]    9    0    2
[11,]    7    8    9
[12,]    7    8    9
> d
[1] 1 2 1 1 4 3 0 0

There may well be more elegant ways to do this, too. 

Cheers,


Bert Gunter
Genentech Nonclinical Biostatistics 



-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
Behalf Of Gabor Grothendieck
Sent: Friday, October 16, 2009 4:27 AM
To: Robin Hankin
Cc: r-help at r-project.org
Subject: Re: [R] generalization of tabulate()

Using the generalized inner product defined in this post:

   https://www.stat.math.ethz.ch/pipermail/r-help/2006-July/109311.html

try this:

   cbind(S, d = rowSums(inner(S, obs, identical)))


On Fri, Oct 16, 2009 at 4:29 AM, Robin Hankin <rksh1 at cam.ac.uk> wrote:
> Hi
>
> I want a generalization of tabulate() which works on rows of a matrix.
> Suppose I have an integer matrix 'observation':
>
>> observation
>
> y1 y2 y3
> 1 4 0
> 1 4 0
> 2 0 3
> 4 1 0
> 0 5 0
> 0 1 4
> 2 0 3
>
> Each row corresponds to a (multivariate) observation.  Note that the
> first two rows are identical: this means that data "c(1,4,0)" was
> observed twice.
>
> Now suppose I can list the sample space:
>
>> S
>         [1,] 5 0 0
> [2,] 4 1 0
> [3,] 3 2 0
> [4,] 2 3 0
> [5,] 1 4 0
> [6,] 0 5 0
> [7,] 4 0 1
> [8,] 3 1 1
> [9,] 2 2 1
> [10,] 1 3 1
> [11,] 0 4 1
> [12,] 3 0 2
> [13,] 2 1 2
> [14,] 1 2 2
> [15,] 0 3 2
> [16,] 2 0 3
> [17,] 1 1 3
> [18,] 0 2 3
> [19,] 1 0 4
> [20,] 0 1 4
> [21,] 0 0 5
>
> (thus each row corresponds to a point in my sample space).
>
> Now what I need to do is to construct a new matrix, which uses the
> 'observation' matrix above, which is a sort of table:
>
>> desired
>
>     y1 y2 y3 d
> [1,] 5 0 0 0
> [2,] 4 1 0 1
> [3,] 3 2 0 0
> [4,] 2 3 0 0
> [5,] 1 4 0 2
> [6,] 0 5 0 1
> [7,] 4 0 1 0
> [8,] 3 1 1 0
> [9,] 2 2 1 0
> [10,] 1 3 1 0
> [11,] 0 4 1 0
> [12,] 3 0 2 0
> [13,] 2 1 2 0
> [14,] 1 2 2 0
> [15,] 0 3 2 0
> [16,] 2 0 3 2
> [17,] 1 1 3 0
> [18,] 0 2 3 0
> [19,] 1 0 4 0
> [20,] 0 1 4 1
> [21,] 0 0 5 0
>
>
> Thus the 'd' column counts the number of times that each row occurs in
> variable 'observation'.  So desired[5,4]=2 because the observation
> corresponding to desired[5,1:3] (viz c(1,4,0)) occurred twice.  And
> desired[1,4]=0 because the observation corresponding to desired[1,1:3]
> (viz c(5,0,0)) did not occur once (it was not observed).
>
> In my application I have dim(S) ~= c(5,4e6).
>
> I've tried merge(), stack(),  reshape(), but the best I can do
> is the (derisory):
>
> require(partitions)
>
>
> obs <- matrix(as.integer(c(
>               1, 4, 0,
>               1, 4, 0,
>               2, 0, 3,
>               4, 1, 0,
>               0, 5, 0,
>               0, 1, 4,
>               2, 0, 3
>               )),ncol=3,byrow=TRUE)
>
> S <- t(compositions(5,3))
> d <- rep(0,nrow(S))
>
>
> for(i in seq_len(nrow(obs))){
>  for(j in seq_len(nrow(S))){
>   if(all(obs[i,,drop=TRUE] == S[j,,drop=TRUE])){
>     d[j] <- d[j]+1
>   }
>  }
> }
>
> S <- cbind(S,d)
>
>
> Anyone got anything better before I try C?
>
>
> --
> Robin K. S. Hankin
> Uncertainty Analyst
> University of Cambridge
> 19 Silver Street
> Cambridge CB3 9EP
> 01223-764877
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list