[R] Large Test Datasets in R

Joshua Wiley jwiley.psych at gmail.com
Mon Jun 25 05:45:15 CEST 2012


Hi Ravi,

My hunch would be "no" because it seems awfully inefficient.  Packages
are mirrored all over the world, and it seems rather silly to be
mirroring, updating, etc. large datasets.

The good news is that if you just want a 10,000 x 100,000 matrix of
0/1s, it is trivial to generate:

X <- matrix(sample(0L:1L, 10^9, TRUE), nrow = 10^4)

Even stored as integers, this is probably going to be around 4GB.  If
you want arbitrary values to later cut:

X <- matrix(rnorm(10^9), nrow = 10^4)

Cheers,

Josh


On Sun, Jun 24, 2012 at 7:08 AM, vioravis <vioravis at gmail.com> wrote:
> I am looking for some large datasets (10,000 rows & 100,000 columns or vice
> versa) to create some test sets.  I am not concerned about the invidividual
> elements since I will be converting them to binary (0/1) by using arbitrary
> thresholds.
>
> Does any R package provide such big datasets?
>
> Also, what is the biggest text document collection available in R? tm
> package seems to provide only 20 records from the Reuters dataset. Is there
> any package that has 10,000+ documents??
>
> Would appreciate any help on these.
>
> Thank you.
>
> Ravi
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Large-Test-Datasets-in-R-tp4634330.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/



More information about the R-help mailing list