[R] select rows with identical columns from a data frame

David Winsemius dwinsemius at comcast.net
Sun Jan 20 19:37:04 CET 2013


On Jan 20, 2013, at 9:27 AM, David Winsemius wrote:

>
> On Jan 20, 2013, at 8:26 AM, Sam Steingold wrote:
>
>>> * Bert Gunter <thagre.oregba at trar.pbz> [2013-01-19 22:26:46 -0800]:
>>>
>>> But David W. and Bill Dunlap gave you solutions that also work and  
>>> are
>>> much faster, no?!
>>
>> Yes, indeed, and I am now using David's solution as it is fast
>> (enough), simple and concise.
>
> I am a bit surprised by that. I do agree that it was simple and  
> concise, two programming virtues that I occasionally achieve.  
> However, when I tested it against either of Bill Dunlap's  
> suggestions mine was 15-40 times slower. (So I saved Bill's code and  
> made a mental note to study it's superiority.) I could see why the  
> f2 version was superior, since it progressively shrank the index  
> candidates for further comparison, but his first function used no  
> such logic and was still 15 times faster.
>
> My test included the creation of the smaller data.frame which his  
> did not, but when I modified mine to only return the index vector,  
> that was the step that consumed all the time. I wondered if it were  
> `which` that consumed the time but it appears the inner step of  
> x==x[[1]] that was the culprit.
>
> > x <- data.frame(lapply(structure(1:10,names=letters[1:10]),  
> function(i) sample(c(NA,1,1,1,2,2,2,3), replace=TRUE, size=1e6)))
>
> > system.time({ keep <- x[[1]] == x[[2]]
> +    for (i in seq_len(ncol(x))[-(1:2)]) {
> +        keep <- keep & x[[i - 1]] == x[[i]]
> +    }
> +    z2 <- !is.na(keep) & keep})
>   user  system elapsed
>  0.179   0.056   0.240
>
> > system.time({z <- rowSums(x==x[[1]]) })
>   user  system elapsed
>  3.535   0.535   4.067
>
> > system.time({z <- x==x[[1]] })
>   user  system elapsed
>  3.540   0.524   4.061
>

A further note: Was able to recover most of the timing efficiency with  
initial coercion of the dataframe structure to matrix before the "=="  
operation:

 > system.time({z <- as.matrix(x)==x[[1]] })
    user  system elapsed
   0.181   0.140   0.320

So it's really `==.data.frame` that is the resource hog.

-- 
David.
> -- 
> David
>
>
>
>>
>> Thanks a lot to David, Bill, Rui, and arun for their answers (to this
>> question, my many previous questions, and, I hope, my future  
>> questions
>> in advance)!
>>
>>> On Sat, Jan 19, 2013 at 9:41 PM, Sam Steingold <sds at gnu.org> wrote:
>>>>> * Rui Barradas <ehvconeenqnf at fncb.cg> [2013-01-18 21:02:20 +0000]:
>>>>>
>>>>> Try the following.
>>>>>
>>>>> complete.cases(f) & apply(f, 1, function(x) all(x == x[1]))
>>>>
>>>> thanks, this works, but is horribly slow (dim(f) is 766,950x2)
>>
> -- 
>
> David Winsemius, MD
> Alameda, CA, USA
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
Alameda, CA, USA



More information about the R-help mailing list