[R] Matrix Manipulation R

David Winsemius dwinsemius at comcast.net
Sat Jul 4 18:40:53 CEST 2015


> On Jul 4, 2015, at 3:09 AM, Alex Kim <dumboisverydumb at gmail.com> wrote:
> 
> Hi guys,
> 
> Suppose I have an extremely large data frame with 2 columns and .5 mil
> rows. For example, the last 6 rows may look like this:
> .
> ..
> ...
> 89         100
> 93         120
> 95         125
> 101        NA
> 115        NA
> 123        NA
> 124        NA
> 
> I would like to manipulate this data frame to output a data frame that
> looks like:,
> 
> 100        89, 93, 95
> 120        101, 115
> 125        123, 124
> 

> What would be the absolute quickest way to do this, given that there are
> many rows? Currently I have this:
> 
> # m is the large two column data frame
> end <- na.omit(m[,'V2']);
> out <- data.frame(End=end,
> Start=unname(sapply(split(m[,'V1'],findInterval(m[,'V1'],end))[as.character(0:c(length(end)-1))],paste,collapse='.')))
> 

This might be a little faster. It skips some of the steps in your version:

 dput(m)
structure(list(V1 = c(89, 93, 95, 101, 115, 123, 124), V2 = c(100, 
120, 125, NA, NA, NA, NA)), .Names = c("V1", "V2"), row.names = c(NA, 
-7L), class = "data.frame")

end <- na.omit(m[,'V2’])
# this will only work if that vector is sorted
data.frame(End = end,
           Start = sapply( split( m$V1, 
                                 findInterval(m$V1, c(-Inf, end))), 
                          paste,collapse="," ) )
  End    Start
1 100 89,93,95
2 120  101,115
3 125  123,124


> However this is taking a little bit too long.
> 
> Thank you for your help!
> 
> 	[[alternative HTML version deleted]]

This is a plain-text mailing list and posting triplicate questions is poor form.

Do read the posting guide.

> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

— 
David Winsemius, MD
Alameda, CA, USA



More information about the R-help mailing list