[R] multiple separators in sep argument for read.table?

(Ted Harding) Ted.Harding at manchester.ac.uk
Sat Apr 19 09:52:59 CEST 2008


On 19-Apr-08 06:19:09, Johan Jackson wrote:
> Hello,
> Is there any way to add multiple separators in the sep= argument
> in read.table? I would like to be able to create different columns
> if I see a white space OR a "/".
> 
> Thanks in advance,
> JJ

As well as Brian Ripley's suggestion for how to do it withnin R,
if you have access to the 'awk' program (as on all Unix/Linux
systems and, in principle, installable in Windows) then you can
pre-process the file outside of R on the following lines.
First, here is a test file "temp.txt":

R1C1 R1C2;R1C3
R2C1,R2C2 R2C3
R3C1,R3C2;R3C3

where each line has 3 fields, separated by any of " " or "," or ";"
and it is desired to obtain a purely comma-separated version of it.

awk '
  BEGIN{FS="[ ]|[;]|[,]";OFS=","};{$1=$1};{print $0}
' < temp.txt  > temp2.txt

produces a file temp2.txt with contents

R1C1,R1C2,R1C3
R2C1,R2C2,R2C3
R3C1,R3C2,R3C3

The logic is that the intialisation
  BEGIN{FS="[ ]|[;]|[,]"} ; OFS=","}
sets up the Field Separator variable FS as a regular
expression which matches any one of " " ";" "," and the
Output Field Separator OFS to be ",".

$0 denotes the entire input line, and the "$1=$1" causes
the first field to be re-computed (to be equal to itself)
so that the whole input line $0 is re-computed at which
point the OFS is then set to "," in $0.

Hence an 'awk' program to handle the case you describe
could be

awk '
  BEGIN{FS="[ ]|[/]";OFS=" "};{$1=$1};{print $0}
' < myrawfile  > myfinalfile

It gets slightly more interesting if your "white space"
separating two fields might be any number of consecutive
spaces or a TAB, say.

In that case something like

awk '
  BEGIN{FS="[ ][ ]*|[;]|[,]|[\t]";OFS=","};{$1=$1};{print $0}
' < myrawfile  > myfinalfile

might be needed. Here "[ ][ ]*" means "one space followed
by zero or more spaces", and "\t" is the notation for TAB.

If I change the test file above to

R1C1   R1C2;R1C3
R2C1,R2C2       R2C3
R3C1,R3C2;R3C3

where the long blank in the first line is 3 consecutive " ",
and the long blank in the second line is a single TAB,
then the second 'awk' program above generates exactly the
same output as before.

Just a thought! I'm always tempted to suggest that people
use 'awk' in conjunction with R, not only to deal with the
kind of relatively simple substitutions you describe, but
also for exploring and cleaning up the sort of mess that
people can send you after exporting a CSV file from an
Excel spreadsheet, etc. (It would go on for too long, to
give examples of this sort of thing.)

With best wishes,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 19-Apr-08                                       Time: 08:52:56
------------------------------ XFMail ------------------------------



More information about the R-help mailing list