[R] Splitting a DF into rows according to a column

Johannes Graumann johannes_graumann at web.de
Tue Oct 5 14:16:40 CEST 2010


Stupid Joh wants to give you a big hug! Thanks! Why "rank" works but "order" 
not, I have still to figure out, though ...

Joh

On Monday 04 October 2010 17:30:32 peter dalgaard wrote:
> On Oct 4, 2010, at 16:57 , Johannes Graumann wrote:
> > Hi,
> > 
> > I'm turning my wheels on this and keep coming around to the same wrong
> > solution - please have a look and give a hand ...
> > 
> > The premise is: a DF like so
> > 
> >> loremIpsum <- "Lorem ipsum dolor sit amet, consectetur adipiscing elit.
> > 
> > Quisque leo ipsum, ultricies scelerisque volutpat non, volutpat et nulla.
> > Curabitur consequat ullamcorper tellus id imperdiet. Duis semper
> > malesuada nulla, blandit lobortis diam fringilla at. Vestibulum nec
> > tellus orci, eu sollicitudin quam. Phasellus sit amet enim diam.
> > Phasellus mattis hendrerit varius. Curabitur ut tristique enim. Lorem
> > ipsum dolor sit amet, consectetur adipiscing elit. Sed convallis, tortor
> > id vehicula facilisis, nunc justo facilisis tellus, sed eleifend nisi
> > lacus id purus. Maecenas tempus sollicitudin libero, molestie laoreet
> > metus dapibus eu. Mauris justo ante, mattis et pulvinar a, varius
> > pretium eros. Curabitur fringilla dui ac dui rutrum pretium. Donec sed
> > magna adipiscing nisi accumsan congue sed ac est. Vivamus lorem urna,
> > tristique quis accumsan quis, ullamcorper aliquet velit."
> > 
> >> tmpDF <- data.frame(Column1=rep(unlist(strsplit(loremIpsum,"
> > 
> > ")),length.out=510),Column2=runif(510,min=0,max=1e8))
> > 
> > is to be split into DFs with 50 entries in an ordered manner according to
> > column2 (first DF ist o contain the rows with the 50 largest numbers,
> > ...).
> > 
> > Here is what I have been doing:
> >> binSize <- 50
> >> splitMembership <-
> > 
> > pmin(ceiling(order(tmpDF[["Column2"]],decreasing=TRUE)/binSize),floor(nro
> > w(tmpDF)/binSize))
> > 
> >> splitList <- split(tmpDF,splitMembership)
> > 
> > Distribution seems to work ...
> > 
> >> sapply(splitList,nrow)
> > 
> > But this is NOT what I wanted ...
> > 
> >> sapply(splitList,function(x){max(x[["Column2"]])})
> > 
> > This was supposed to give me bins that are Column2-sorted and bin one
> > should have a higher max than 2 than 3 ...
> > 
> > Can anyone point out where (my now 3 reimplementations) fail?
> > 
> > Thanks, Stupid Joh
> 
> Dear Stupid Joh,
> 
> Have you considered something along the lines of
> 
> o <- order(-x$Column2)
> xx <- x[o,]
> split(xx, (seq_len(NROW(x))-1) %/% 50)
> 
> The above is a bit hard to follow, but it seems to work better with rank() 
instead of order():
> > splitMembership <-
> 
> +
> pmin(ceiling(rank(-tmpDF[["Column2"]])/binSize),floor(nrow(tmpDF)/binSize)
> )
> 
> > splitList <- split(tmpDF,splitMembership)> sapply(splitList,nrow)
> 
>  1  2  3  4  5  6  7  8  9 10
> 50 50 50 50 50 50 50 50 50 60
> 
> > sapply(splitList,function(x){max(x[["Column2"]])})
> 
>        1        2        3        4        5        6
> 99877498 90567877 81965382 69112280 59814266 52130373
>        7        8        9       10
> 41557660 32630212 21226996 11880032
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part.
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20101005/1f21d648/attachment.bin>


More information about the R-help mailing list