[R] combine many .csv files into a single file/data frame

(Ted Harding) Ted.Harding at nessie.mcc.ac.uk
Sat Oct 16 11:39:05 CEST 2004


On 15-Oct-04 bogdan romocea wrote:
> I have a few hundred .csv files which I need to put
> together (full outer joins on a common variable) to do a
> factor analysis. Each file may contain anywhere from a few
> hundred to a few thousand rows. What would be the most
> efficient way to do this in R? Please include some sample
> code if applicable.

Sorry, the example I posted last night was written too hastily
and does not illustrate your precise query well. Here is a
better one, with explanation.

$ cat j1
1,a
1,b
2,c
2,d
2,e
3,f
3,g
4,h
4,i
5,j

$ cat j2
1,A
1,B
1,C
2,D
2,E
4,F
4,G
5,H
6,I
6,J

$ join -t , -a 1 -a 2 -o 0,1.2,2.2 j1 j2
1,a,A
1,a,B
1,a,C
1,b,A
1,b,B
1,b,C
2,c,D
2,c,E
2,d,D
2,d,E
2,e,D
2,e,E
3,f,
3,g,
4,h,F
4,h,G
4,i,F
4,i,G
5,j,H
6,,I
6,,J

Explanation of options:

"-t ,"    Input and output field separator is "," (for CSV)
"-a 1"    Output a line for every line of j1 not matched in j2
"-a 2"    Output a line for every line of j2 not matched in j1
"-o 0,1.2,2.2"  Output field format specification:
    0 denotes the match (join) field (needed when using "-a")
    1.2 denotes field 2 from file 1 ("j1")
    2.2 denotes field 2 from file 2 ("j2")
"j1" and "j2" are of course the two files to be joined.

Using the "-a" option gives you the full outer join which you want.

This command only works for two files at a time (and you must give
two). To join several files you would have to loop through them on
the lines of

$ join -t , -a 1 -a 2 -o 0,1.2,2.2 j1 j2 > J

which creates a file "J" which is the full outer join of "j1", "j2".

Then

$ join -t , -a 1 -a 2 -o 0,1.2,2.2 J j3 > J

and so on through j4, j5, ...

For your "few hundred files" this is best done with a loop like

$ for i in * ; do join -t , -a 1 -a 2 -o 0,1.2,2.2 J $i > J ; done

having first done it for j1, j2 as above and put these two file
out of sight in a different directory. J itself also needs to
be out of sight otherwise it will get joined to itself at some
stage. E.g. where "J" is written above it could in fact be
written as "../joins/J" and similarly for "../joins/j1" and
"../joins/j2".

E.g. first move j1 & j2 to ../joins and then do

$ join -t , -a 1 -a 2 -o 0,1.2,2.2 ../joins/j1 ../joins/j2 > ../joins/J

and then

$ for i in * ; do
    join -t , -a 1 -a 2 -o 0,1.2,2.2 ../joins/J $i > ../joins/J
  done


Sorry for the previous sloppy response, and hoping the above
helps.

Best wishes,
Ted.



--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861  [NB: New number!]
Date: 16-Oct-04                                       Time: 10:39:05
------------------------------ XFMail ------------------------------




More information about the R-help mailing list