[R] string-to-number

Mon Aug 21 16:16:06 CEST 2006

Marc,

Thanks very much for this.  I hadn't really looked at Rprof in the
past; now I have a new toy to play with!

I have formulated an hypothesis that the reason parse/eval is quicker
lies in the pattern-matching code:  strsplit is using regular
expressions, whereas perhaps parse is using some more clever (but
possibly less general) matching algorithm.  It will be interesting to
inspect the source code to get to the bottom of it.

Thanks again for your interest and efforts in this, and for pointing out Rprof!

Regards,

Mike Nielsen

On 8/20/06, Marc Schwartz <MSchwartz at mn.rr.com> wrote:
> On Sat, 2006-08-19 at 10:25 -0600, Mike Nielsen wrote:
> > Wow.  New respect for parse/eval.
> >
> > Do you think this is a special case of a more general principle?  I
> > suppose the cost is memory, but from time to time a speedup like this
> > would be very beneficial.
> >
> > Any hints about how R programmers could recognize such cases would, I
> > am sure, be of value to the list in general.
> >
> > Many thanks for your efforts, Marc!
>
> Mike,
>
> I think that one needs to consider where the time is being spent and
> then adjust accordingly. Once you understand that, you can develop some
> insight into what may be a more efficient approach. R provides good
> profiling tools that facilitate this process.
>
> In this case, almost all of the time in the first two examples using
> strsplit(), is in that function:
>
> > repeated.measures.columns <- paste(1:100000, collapse = ",")
>
> > library(utils)
> > Rprof(tmp <- tempfile())
> > res1 <- as.numeric(unlist(strsplit(repeated.measures.columns, ",")))
> > Rprof()
>
> > summaryRprof(tmp)
> $by.self
>                     self.time self.pct total.time total.pct
> "strsplit"              23.68     99.7      23.68      99.7
> "as.double.default"      0.06      0.3       0.06       0.3
> "as.numeric"             0.00      0.0      23.74     100.0
> "unlist"                 0.00      0.0      23.68      99.7
>
> $by.total
>                     total.time total.pct self.time self.pct
> "as.numeric"             23.74     100.0      0.00      0.0
> "strsplit"               23.68      99.7     23.68     99.7
> "unlist"                 23.68      99.7      0.00      0.0
> "as.double.default"       0.06       0.3      0.06      0.3
>
> $sampling.time
> [1] 23.74
>
>
> Contrast that with Prof. Ripley's approach:
>
> > Rprof(tmp <- tempfile())
> > res3 <- eval(parse(text=paste("c(", repeated.measures.columns, ")")))
> > Rprof()
>
> > summaryRprof(tmp)
> $by.self
>         self.time self.pct total.time total.pct
> "parse"      0.42     87.5       0.42      87.5
> "eval"       0.06     12.5       0.48     100.0
>
> $by.total
>         total.time total.pct self.time self.pct
> "eval"        0.48     100.0      0.06     12.5
> "parse"       0.42      87.5      0.42     87.5
>
> $sampling.time
> [1] 0.48
>
>
> To some extent, one could argue that my initial timing examples are
> contrived, in that they specifically demonstrate a worst case scenario
> using strsplit().  Real world examples may or may not show such gains.
>
> For example with Charles' initial query, the initial vector was rather
> short:
>
>   > repeated.measures.columns
>   [1] "3,6,10"
>
> So if this was a one-time conversion, we would not see such significant
> gains.
>
> However, what if we had a long list of shorter entries:
>
> > repeated.measures.columns <- paste(1:10, collapse = ",")
> > repeated.measures.columns
> [1] "1,2,3,4,5,6,7,8,9,10"
>
> > big.list <- replicate(10000, list(repeated.measures.columns))
>
> > head(big.list)
> [[1]]
> [1] "1,2,3,4,5,6,7,8,9,10"
>
> [[2]]
> [1] "1,2,3,4,5,6,7,8,9,10"
>
> [[3]]
> [1] "1,2,3,4,5,6,7,8,9,10"
>
> [[4]]
> [1] "1,2,3,4,5,6,7,8,9,10"
>
> [[5]]
> [1] "1,2,3,4,5,6,7,8,9,10"
>
> [[6]]
> [1] "1,2,3,4,5,6,7,8,9,10"
>
>
> > system.time(res1 <- t(sapply(big.list, function(x)
> as.numeric(unlist(strsplit(x, ","))))))
> [1] 1.972 0.044 2.411 0.000 0.000
>
> > str(res1)
>  num [1:10000, 1:10] 1 1 1 1 1 1 1 1 1 1 ...
>
> > head(res1)
>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
> [1,]    1    2    3    4    5    6    7    8    9    10
> [2,]    1    2    3    4    5    6    7    8    9    10
> [3,]    1    2    3    4    5    6    7    8    9    10
> [4,]    1    2    3    4    5    6    7    8    9    10
> [5,]    1    2    3    4    5    6    7    8    9    10
> [6,]    1    2    3    4    5    6    7    8    9    10
>
>
>
> Now use Prof. Ripley's approach:
>
> > system.time(res3 <- t(sapply(big.list, function(x)
> eval(parse(text=paste("c(", x, ")"))))))
> [1] 1.676 0.012 1.877 0.000 0.000
>
> > str(res3)
>  num [1:10000, 1:10] 1 1 1 1 1 1 1 1 1 1 ...
>
> > head(res3)
>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
> [1,]    1    2    3    4    5    6    7    8    9    10
> [2,]    1    2    3    4    5    6    7    8    9    10
> [3,]    1    2    3    4    5    6    7    8    9    10
> [4,]    1    2    3    4    5    6    7    8    9    10
> [5,]    1    2    3    4    5    6    7    8    9    10
> [6,]    1    2    3    4    5    6    7    8    9    10
>
>
>
> > all(res1 == res3)
> [1] TRUE
>
>
> We do see a notable reduction in time with strsplit(), while a notable
> increase in time using eval(parse)), even though we are converting the
> same net number of values (100,000).
>
> Much of the increase with eval(parse()) is of course due to the overhead
> of sapply() and navigating the list.
>
>
> Let's increase the size of the list components to 1000:
>
> > repeated.measures.columns <- paste(1:1000, collapse = ",")
> > big.list <- replicate(10000, list(repeated.measures.columns))
>
> > system.time(res1 <- sapply(big.list, function(x)
> as.numeric(unlist(strsplit(x, ",")))))
> [1] 33.270  0.744 37.163  0.000  0.000
>
> > system.time(res3 <- t(sapply(big.list, function(x)
> eval(parse(text=paste("c(", x, ")"))))))
> [1] 15.893  0.928 18.139  0.000  0.000
>
>
> So we see here that as the size of the list components increases, there
> continues to be an advantage to Prof. Ripley's approach over using
> strsplit().
>
> Again, one needs to develop an understanding of where the time is spent
> in the processing by profiling and then consider how to introduce
> efficiencies, which in some cases may very well require the use of
> compiled C/FORTRAN as may be appropriate if times become too long.
>
> HTH,
>
> Marc Schwartz
>
>
>

-- 
Regards,

Mike Nielsen