[R] Sorting and subsetting

Matthew Dowle mdowle at mdowle.plus.com
Tue Sep 21 18:43:49 CEST 2010


See data.table:::duplist which does that (or at least very similar) in C,
for multiple columns too.

Matthew
http://datatable.r-forge.r-project.org/


"peter dalgaard" <pdalgd at gmail.com> wrote in message 
news:660991C3-B52B-4D58-B819-EADC95ECCD88 at gmail.com...
>
> On Sep 21, 2010, at 16:27 , Joshua Wiley wrote:
>
>> On Tue, Sep 21, 2010 at 3:09 AM, Matthew Dowle <mdowle at mdowle.plus.com> 
>> wrote:
>>>
>>>
>>> All the solutions in this thread so far use the lapply(split(...)) 
>>> paradigm
>>> either directly or indirectly. That paradigm doesn't scale. That's the
>>> likely
>>> source of quite a few 'out of memory' errors and performance issues in 
>>> R.
>>
>> This is a good point.  It is not nearly as straightforward as the
>> syntax for data.table (which seems to order and select in one
>> step...very nice!), but this should be less memory intensive:
>>
>> tmp <- data.frame(index = gl(2,20), foo = rnorm(40))
>> tmp <- tmp[order(tmp$index, tmp$foo) , ]
>>
>> # find location of first instance of each level and add 0:4 to it
>> x <- sapply(match(levels(tmp$index), tmp$index), `+`, 0:4)
>>
>> tmp[x, ]
>>
>
> That will get you in trouble if any group has size less than 5, though.
>
> Something involving duplicated() could work; you "just" need to generate 
> the sawtooth sequence: 0,1,2,3,4,0,1,2,3,4,5,6,0,1,2,... and select values 
> less than or equal 4. I _think_ this should work (it does on the 
> airquality dataframe, anyway):
>
> ix <- tmp$index
>
> s <- seq_along(ix)
> j <- diff(s[!duplicated(ix)])
> s2 <- rep.int(0, length(s))
> s2[!duplicated(ix)] <- c(1,j)
> d <- s - cumsum(s2)
>
> tmp[d < 5,]
>
> Or, another version of the same idea, giving "teeth" starting at 1 instead
>
> d <- s - c(0,cumsum(table(ix)))[factor(ix)]
> tmp[d <= 5, ]
>
>
>
> (There are times when I contemplate writing a DATAstep() function, this is 
> one of those things that are straightforward in the SAS sequential 
> processing paradigm. Of course there are things that are much more 
> complicated in SAS, too.)
>
>
>>>
>>> data.table doesn't do that internally, and it's syntax is pretty easy.
>>>
>>>> tmp <- data.table(index = gl(2,20), foo = rnorm(40))
>>>
>>>> tmp[, .SD[head(order(-foo),5)], by=index]
>>>      index index.1       foo
>>>  [1,]     1       1 1.9677303
>>>  [2,]     1       1 1.2731872
>>>  [3,]     1       1 1.1100931
>>>  [4,]     1       1 0.8194719
>>>  [5,]     1       1 0.6674880
>>>  [6,]     2       2 1.2236383
>>>  [7,]     2       2 0.9606766
>>>  [8,]     2       2 0.8654497
>>>  [9,]     2       2 0.5404112
>>> [10,]     2       2 0.3373457
>>>>
>>>
>>> As you can see it currently repeats the group column which is a
>>> shame (on the to do list to fix).
>>>
>>> Matthew
>>>
>>> http://datatable.r-forge.r-project.org/
>>>
>>>
>>> --
>>> View this message in context: 
>>> http://r.789695.n4.nabble.com/Sorting-and-subsetting-tp2547360p2548319.html
>>> Sent from the R help mailing list archive at Nabble.com.
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide 
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>>
>> -- 
>> Joshua Wiley
>> Ph.D. Student, Health Psychology
>> University of California, Los Angeles
>> http://www.joshuawiley.com/
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> -- 
> Peter Dalgaard
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
>



More information about the R-help mailing list