[R] Merge sort

Wed Apr 20 14:02:53 CEST 2016

On 20/04/2016 7:38 AM, Gaston wrote:
> I indeed used is.na() to check length, as I was not sure weather
> lenght() was a simple query or would go through the whole vector to
> count the elements.

length() is a simple query, and is very fast.  The other problem in your 
approach (which may not be a problem with your current data) is that NA 
is commonly used as an element of a vector to represent a missing value.

>
> So to sum up, function calls are expensive, therefore recursion should
> be avoided, and growing the size of a vector (which is probably
> reassigning and copying?) is also expensive.

"Avoided" may be too strong:  speed isn't always a concern, sometimes 
clarity is more important.  Growing vectors is definitely expensive.

Duncan Murdoch

>
> Thank you for your help!
>
>
>
> On 04/19/2016 11:51 PM, Duncan Murdoch wrote:
>> On 19/04/2016 3:39 PM, Gaston wrote:
>>> Hello everyone,
>>>
>>> I am learning R since recently, and as a small exercise I wanted to
>>> write a recursive mergesort. I was extremely surprised to discover that
>>> my sorting, although operational, is deeply inefficient in time. Here is
>>> my code :
>>>
>>>> merge <- function(x,y){
>>>>     if (is.na(x[1])) return(y)
>>>>     else if (is.na(y[1])) return(x)
>>>>     else if (x[1]<y[1]) return(c(x[1],merge(x[-1],y)))
>>>>     else return(c(y[1],merge(x,y[-1])))
>>>> }
>>>>
>>>> division <- function(x){
>>>>     if (is.na(x[3])) return(cbind(x[1],x[2]))
>>>>     else
>>>> return(cbind(c(x[1],division(x[-c(1,2)])[,1]),c(x[2],division(x[-c(1,2)])[,2])))
>>>>
>>>> }
>>>>
>>>> mergesort <- function(x){
>>>>     if (is.na(x[2])) return(x)
>>>>     else{
>>>>       print(x)
>>>>       t=division(x)
>>>>       return(merge(mergesort(t[,1]),mergesort(t[,2])))
>>>>     }
>>>> }
>>>
>>> I tried my best to write it "the R-way", but apparently I failed. I
>>> suppose some of the functions I used are quite heavy. I would be
>>> grateful if you could give a hint on how to change that!
>>>
>>> I hope I made myself clear and wish you a nice day,
>>
>> Your use of is.na() looks strange.  I don't understand why you are
>> testing element 2 in mergesort(), and element 1 in merge(), and
>> element 3 in division.  Are you using it to test the length?  It's
>> better to use the length() function for that.
>>
>> The division() function returns a matrix.  It would make more R-sense
>> to return a list containing the two parts, because they might not be
>> the same length.
>>
>> Generally speaking, function calls are expensive in R, so the
>> recursive merge you're using looks like it would be the bottleneck.
>> You'd almost certainly be better off to allocate something of
>> length(x) + length(y), and do the assignments in a loop.
>>
>> Here's a merge sort I wrote as an illustration in a class.  It's
>> designed for clarity rather than speed, but I'd guess it would be
>> faster than yours:
>>
>> mergesort <- function(x) {
>>
>>    n <- length(x)
>>    if (n < 2) return(x)
>>
>>    # split x into two pieces of approximately equal size, x1 and x2
>>
>>    x1 <- x[1:(n %/% 2)]
>>    x2 <- x[(n %/% 2 + 1):n]
>>
>>    # sort each of the pieces
>>    x1 <- mergesort(x1)
>>    x2 <- mergesort(x2)
>>
>>    # merge them back together
>>    result <- c()
>>    i <- 0
>>    while (length(x1) > 0 && length(x2) > 0) {
>>      # compare the first values
>>      if (x1[1] < x2[1]) {
>>        result[i + 1] <- x1[1]
>>        x1 <- x1[-1]
>>      } else {
>>        result[i + 1] <- x2[1]
>>        x2 <- x2[-1]
>>      }
>>      i <- i + 1
>>    }
>>
>>    # put the smaller one into the result
>>    # delete it from whichever vector it came from
>>    # repeat until one of x1 or x2 is empty
>>    # copy both vectors (one is empty!) onto the end of the results
>>    result <- c(result, x1, x2)
>>    result
>> }
>>
>> If I were going for speed, I wouldn't modify the x1 and x2 vectors,
>> and I'd pre-allocate result to the appropriate length, rather than
>> growing it in the while loop.  But that was a different class!
>>
>> Duncan Murdoch
>