[R] preprocessing data

Tue Aug 16 20:07:24 CEST 2005

Thank you Gabor,

Jean

On Tue, 16 Aug 2005, Gabor Grothendieck wrote:

> On 8/16/05, Jean Eid <jeaneid at chass.utoronto.ca> wrote:
> > Dear all,
> >
> > My question is concerning the line
> > "This is adequate for small files, but for anything more complicated we
> > recommend using the facilities   of a language like perl to pre-process
> > the file."
> >
> > in the import/export manual.
> >
> > I have a large fixed-width file that I would like to preprocess in Perl or
> > awk. The problem is that I do not know where to start. Does anyone have a
> > simple example on how to turn a fixed-width file in any of these
> > facilities into csv or tab delimited file. I guess I am looking for
> > somewhat a perl for dummies or awk for dummies that does this. any
> > pointers for website will be greatly appreciated
> >
>
>
>
> Try to do it in R first.  I have found that I rarely need to go to
> an outside language to massage my data.
>
> 	# fixed with fields of 10 and 5
> 	Lines <- readLines("mydata.dat")
> 	data.frame( field1 = as.numeric(substring(1,10,Lines),
> 		field2 = as.numeric(substring(11,15,Lines) )
>
> If you do find that you have speed or memory problems that
> require that you go outside of R to preprocess your data
> then the gawk version of awk has a FIELDWIDTHS variable that
> makes handling fixed fields very easy.  The gawk program below
> assumes two fields of widths 10 and 5, respectively, which
> is set in the first line.   Then it repeatedly executes the
> second line for each input line forcing field splitting by a
> dummy manipulation (since field splitting is lazy) and then
> printing each line, the default being to print out the
> entire line with a space between successive fields:
>
> 	BEGIN { FIELDWIDTHS = "10 5" }
> 	{ $1 = $1; print }
>
> In R, do the following assuming the above two lines are in
> split.awk:
>
> 	read.table(pipe("gawk -f split.awk mydata.dat"))
>
> or else run gawk outside of R then read in the output file
> created:
>
> 	gawk -f split.awk mydata.dat > mydata2.dat
>
> For more information, google for
>
> 	FIELDWIDTHS gawk
>
> for that portion of the manual on FIELDWIDTHS -- it includes
> an example and, of course, the whole manual is there too.  The
> book by Kernighan et al is also good.
>
> I have used both awk and perl and I think its unlikely you
> would need perl given that you have R at your disposal for
> the hard parts and awk is easier to learn, better designed
> and more focused on this sort of task.
>