[R] help: program efficiency

Mike Marchywka marchywka at hotmail.com
Sat Nov 27 16:12:01 CET 2010




>
> So in this example, it seems more efficient to sort first and use the
> algorithm assuming that the data is sorted.
>
> There is probably a way to be smarter in nodup_cpp where the bottleneck
> is likely to be related to map::find.

If you just use a hash table, std::map should work too,
I don't see what there is to sort, see my earlier
post. You do however need to be careful about sum-of-pieces timing
especially if you ever end up in VM. Memory coherence can be a big deal,
removing a sort can slow other things down later in some cases.
I hate to ask but those variables foo[i] are not maps are they?
If you care about efficiency you should be using arrays here, IIRC
map has to handle these as sparse arrays and that slows things down.
However, if you made a map of prior occurences of each value, 
foo[v[i]] that may be faster than doing a sort hard to say.

>
>
> Profiling reveals this:
>
>> Rprof()
>> for(i in 1:100) { res6 <- ( nodup_cpp_hybrid( x, sort.list(x) ) ) }
>> Rprof(NULL)
>> summaryRprof()
> $by.self
> self.time self.pct total.time total.pct
> "sort.list" 6.50 90.03 6.50 90.03
> ".Call" 0.42 5.82 0.42 5.82
> "file.exists" 0.30 4.16 0.30 4.16
>
> $by.total
> total.time total.pct self.time self.pct
> "nodup_cpp_hybrid" 7.22 100.00 0.00 0.00
> "sort.list" 6.50 90.03 6.50 90.03
> ".Call" 0.42 5.82 0.42 5.82
> "file.exists" 0.30 4.16 0.30 4.16
>
> $sample.interval
> [1] 0.02
>
> $sampling.time
> [1] 7.22
>
>
> The 4.16 % taken by file.exists indicates that someone in the inline
> project has to do some work (on my TODO list).

I've never used the R profiler but according to docs on 'dohs this is wall clock
time. Time blocking for IO may dominate depending on how filesystem works.
I often do point out that IO can dominate things that everyone is expecting
to be CPU bound- this often comes up with cygwin where you have another layer
of stuff over the OS but can happen anywhere.


>
> But otherwise sort.list dominates the time.
> 		 	   		  


More information about the R-help mailing list