[R] pairwise cross tabulation tables

Charles C. Berry cberry at tajo.ucsd.edu
Thu Jan 10 02:45:54 CET 2008


On Wed, 9 Jan 2008, AndyZon wrote:

>
> Hi,
>
> I have a huge number of categorical variables, say at least 10000, and I put
> them into a matrix, each column is one variable. The question is: how can I
> make all of the pairwise cross tabulation tables efficiently? The
> straightforward solution is to use for-loops by looping two indexes on the
> table() function, but it was just too slow. Is there a more efficient way to
> do that? Any guidance will be greatly appreciated.

The totals are merely the crossproducts of a suitably constructed binary 
(zero-one) matrix is used to encode the categories. See '?contr.treatment' 
if you cannot grok 'suitably constructed'.

If the categories are all dichotomies coded as 0:1, you can use

 	res <- crossprod( dat )

to find the totals for the (1,1) cells

If you need the full tables, you can get them from the marginal totals 
using

 	diag( res )

to get the number in each '1' margin and

 	nrow(dat)

to get the table total from which the numbers in each '0' margin by 
subtracting the corresponding '1' margin.

With dichotomous variables, dat has 10000 columns and you will only need 
10000^2 integers or about 0.75 Gigabytes to store the 'res'. And it takes 
about 20 seconds to run 1000 rows on my MacBook. Of course, 'res' has a 
redundant triangle

This approach generalizes to any number of categories:

To extend this to more than two categories, you will need to do for each 
such column what model.matrix(~factor( dat[,i] ) ) does by default
( using 'contr.treatment' ) - construct zero-one codes for all but one 
(reference) category.

Note that with 10000 trichotomies, you will have a result with

 	10000^2 * ( 3-1 )^2

integers needing about 3 Gigabytes, and so on.

HTH,

Chuck

p.s. Why on Earth are you doing this????


>
> Andy
> -- 
> View this message in context: http://www.nabble.com/pairwise-cross-tabulation-tables-tp14723520p14723520.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901




More information about the R-help mailing list