[R] uniq -c

Duncan Murdoch murdoch.duncan at gmail.com
Tue Oct 16 18:47:36 CEST 2012


On 16/10/2012 12:29 PM, Sam Steingold wrote:
> > * R. Michael Weylandt <zvpunry.jrlynaqg at tznvy.pbz> [2012-10-16 16:19:27 +0100]:
> >
> > Have you looked at using table() directly? If I understand what you
> > want correctly something like:
> >
> > table(do.call(paste, x))
>
> I wished to avoid paste (I will have to re-split later, so it will be a
> performance nightmare).
>
> > Also, if you take a look at the development version of R, changes are
> > being put in place to allow much larger data sets.
> >>
> >> xtabs(), although dog slow, would have footed the bill nicely:
> >> --8<---------------cut here---------------start------------->8---
> >>> x <- data.frame(a=1:32,b=1:32,c=1:32,d=1:32,e=1:32)
> >>> system.time(subset(as.data.frame(xtabs( ~. , x )), Freq != 0 ))
> >>    user  system elapsed
> >>  12.788   4.288  17.224
> >> --8<---------------cut here---------------end--------------->8---
>
> you should not need "much larger data sets" for this.
> x is sorted.

The problem is that xtabs() and by() and related functions are designed 
for the case where all combinations of all factors exist. If you have a 
dataset where only a few exist, you could use sparseby() from the 
reshape package.

Syntax would be

sparseby(data=x, INDICES=x, FUN=nrow)

if you wanted a dataframe giving counts.

I just tried it, and on your two examples it gives a warning about 
coercing a list to a logical vector; I guess all(list(TRUE, TRUE)) was 
allowed when I wrote it, but isn't any more.  I'll send a patch to the 
maintainer.

Duncan Murdoch




More information about the R-help mailing list