[R] Lining up x-y datasets based on values of x

Christos Hatzis christos at nuverabio.com
Thu Feb 1 22:48:42 CET 2007


[Sorry I meant to reply to the list]

Thanks, Marc.

That's what I have done.
However, there seems to be a penalty from using merge repeatedly as it
appears to internally re-sort the datasets.  In my case the datasets are
long (~35K rows) and already sorted so this step adds considerable and
unnecessary overhead.  There doesn't seem to be an option for disabling
sorting. Setting 'sort=F' only affects sorting of the final data.frame.

> system.time(merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]], 
> by="V1", all=T, sort=T))
[1] 6.96 0.00 7.24   NA   NA
> system.time(merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]], 
> by="V1", all=T, sort=F))
[1] 6.82 0.00 7.14   NA   NA
> 

I was wondering if perhaps there is a parallel between this problem and
methods for linining up time-series data, since such data are also usually
sorted on the time dimension. 

-Christos  

-----Original Message-----
From: Marc Schwartz [mailto:marc_schwartz at comcast.net] 
Sent: Thursday, February 01, 2007 4:21 PM
To: christos at nuverabio.com
Cc: r-help at stat.math.ethz.ch
Subject: Re: [R] Lining up x-y datasets based on values of x

On Thu, 2007-02-01 at 15:45 -0500, Christos Hatzis wrote:
> Thanks Marc and Phil.
> 
> My dataset actually consists of 50+ individual files, so I will have 
> to do this one column at a time in a loop...
> I might look into SQL and outer joints as an alternative to avoid looping.
> 
> Thanks again.
> -Christos

If the files conform to some naming convention and/or are all located in a
common sub-directory, you can use list.files() to get the file names into a
vector.  If not, you could use file.choose() interactively.

Then use either a for() loop or sapply() to loop over the filenames, read
them in to data frames using read.table() and merge them together in the
same loop.

When it comes to basic data manipulation like this, loops are not a bad
thing. The overhead of a loop is typically outweighed by the file I/O and
related considerations.

HTH,

Marc



More information about the R-help mailing list