[R] "Best" way to merge 300+ .5MB dataframes?

Prof Brian Ripley ripley at stats.ox.ac.uk
Tue Aug 12 08:56:14 CEST 2014


On 12/08/2014 07:07, David Winsemius wrote:
>
> On Aug 11, 2014, at 8:01 PM, John McKown wrote:
>
>> On Mon, Aug 11, 2014 at 9:43 PM, Thomas Adams <tea3rd at gmail.com> wrote:
>>> Grant,
>>>
>>> Assuming all your filenames are something like file1.txt,
>>> file2.txt,file3.txt... And using the Mac OSX terminal app (after you cd to
>>> the directory where your files are located...
>>>
>>> This will strip off the 1st lines, that is, your header lines:
>>>
>>> for file in *.txt;do
>>> sed -i '1d'${file};
>>> done
>>>
>>> Then, do this:
>>>
>>> cat *.txt > newfilename.txt
>>>
>>> Doing both should only take a few seconds, depending on your file sizes.
>>>
>>> Cheers!
>>> Tom
>>>
>>
>> Using sed hadn't occurred to me. I guess I'm just "awk-ward" <grin/>.
>> A slightly different way would be:
>>
>> for file in *.txt;do
>>   sed '1d' ${file}
>> done >newfilename.txt
>>
>> that way the original files are not modified.  But it strips out the
>> header on the 1st file as well. Not a big deal, but the read.table
>> will need to be changed to accommodate that. Also, it creates an
>> otherwise unnecessary intermediate file "newfilename.txt". To get the
>> 1st file's header, the script could:
>>
>> head -1 >newfilename.txt
>> for file in *.txt;do
>>    sed '1d' ${file}
>> done >>newfilename.txt
>>
>> I really like having multiple answers to a given problem. Especially
>> since I have a poorly implemented version of "awk" on one of my
>> systems. It is the vendor's "awk" and conforms exactly to the POSIX
>> definition with no additions. So I don't have the FNR built-in
>> variable. Your implementation would work well on that system. Well, if
>> there were a version of R for it. It is a branded UNIX system which
>> was designed to be totally __and only__ POSIX compliant, with few
>> (maybe no) extensions at all. IOW, it stinks. No, it can't be
>> replaced. It is the z/OS system from IBM which is EBCDIC based and
>> runs on the "big iron" mainframe, system z.
>>
>> --
>
> On the Mac the awk equivalent is gawk. Within R you would use `system()` possibly using paste0() to construct a string to send.

For historical reasons this is actually part of R's configuration: see 
the AWK entry in R_HOME/etc/Makeconf.  (There is an SED entry too: not 
all sed's in current OSes are POSIX-compliant.)

Using system2() rather than system() is recommended for new code.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Emeritus Professor of Applied Statistics, University of Oxford
1 South Parks Road, Oxford OX1 3TG, UK



More information about the R-help mailing list