[R] Creating a cross table out of a large dataset

Marc Schwartz marc_schwartz at comcast.net
Thu Jul 26 23:06:25 CEST 2007


On Thu, 2007-07-26 at 13:32 -0700, celine wrote:
> Dear all,
> 
> I want to make a cross table out of a data set which is 2 columns wide and
> more than 150000 rows long. When I use the table() function I get an error
> message
> 
> This is the code I have used:
> 
> >Dataset <- read.table("test.txt", header=TRUE, sep=",", na.strings="NA",
> dec=".", strip.white=TRUE) 
> 
> > .T <-table(Dataset$K1,Dataset$K2) 
> 
> This is the error message I have received
> 
> >Error in vector("integer", length) : vector size specified is too large 
> >In addition: Warning messages: 
> >1: NAs introduced by coercion 
> >2: NAs introduced by coercion 
> 
> Is it possible to make a cross table with the table() function on a large
> dataset or should I consider using another function? I have had a look at
> the ?table help file but I could find any information on the size of the
> dataset.
> 
> Thanks very much in advance for any help:-)
> 
> Kind regards,
> Céline.

A wild guess here, but it sounds like your data does not likely contain
a relatively small set of repeated discrete entries.

Thus, your cross-tabulation results in a large number of combinations,
the number of which exceeds the largest representable integer in R,
which is:

> .Machine$integer.max
[1] 2147483647

or

> 2^31 - 1
[1] 2147483647


An R table is a two (or possibly more) dimension matrix with additional
class attributes.  A matrix is in turn, a vector with 'dim' attributes.
A vector is indexed using integers and thus is limited in size to the
above number.

If the above assumptions are correct, I am struggling to think of a
scenario where the visual representation of a cross-tabulation of your
data will be of value, but that may be just do to a severe lack of sleep
of late.

You might want to run:

 length(unique(Dataset$K1))

and 

  length(unique(Dataset$K2))

which will tell you how many unique values are in each of the two
vectors. That will begin to give you some idea as to what you are
dealing with.

HTH,

Marc Schwartz



More information about the R-help mailing list