[R] Optimize code to read text-file with digits

peter dalgaard pdalgd at gmail.com
Fri Sep 8 14:03:56 CEST 2017


Simplest version that I can think of:

x <- scan("~/Downloads/digits.txt")
x <- x[-seq(1,220000,11)]
length(x) # 200000
hist(x)

Now, because it's Friday: 

How does one work out the theoretical distribution of the following table?

> table(table(factor(x,levels=0:99999)))

    0     1     2     3     4     5     6     7     8     9    10    11 
13497 27113 27010 18116  9122  3466  1186   366    99    22     1     1 
   12 
    1 

(I.e., out of 200000 random 5 digit numbers, 13497 numbers never occurred, 27113 numbers exactly once, and ... and 1 number occurred 12 times.)

-pd


> On 8 Sep 2017, at 10:48 , Martin Møller Skarbiniks Pedersen <traxplayer at gmail.com> wrote:
> 
> Hi,
> 
>  Every day I try to write some small R programs to improve my R-skills.
>  Yesterday I wrote a small program to read the digits from "A Million
> Random Digits" from RAND.
>  My code works but it is very slow and I guess the code is not optimal.
> 
> The digits.txt file downloaded from
> https://www.rand.org/pubs/monograph_reports/MR1418.html
> contains 20000 lines which looks like this:
> 00000   10097 32533  76520 13586  34673 54876  80959 09117  39292 74945
> 00001   37542 04805  64894 74296  24805 24037  20636 10402  00822 91665
> 00002   08422 68953  19645 09303  23209 02560  15953 34764  35080 33606
> 00003   99019 02529  09376 70715  38311 31165  88676 74397  04436 27659
> 00004   12807 99970  80157 36147  64032 36653  98951 16877  12171 76833
> 
> My program which is slow looks like this:
> 
> filename <- "digits.txt"
> lines <- readLines(filename)
> 
> numbers <- vector('numeric')
> for (i in 1:length(lines)) {
> 
>    # remove first column
>    lines[i] <- sub("[^ ]+ +","",lines[i])
> 
>    # remove spaces
>    lines[i] <- gsub(" ","",lines[i])
> 
>    # split the characters and convert them into numbers
>    numbers <- c(numbers,as.numeric(unlist(strsplit(lines[i],""))))
> }
> 
> Thanks for any advice how this program can be improved.
> 
> Regards
> Martin M. S. Pedersen
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com



More information about the R-help mailing list