[R] good and bad ways to import fixed column data (rpy)

Mon Aug 17 00:36:38 CEST 2009

Check out ?read.fwf

On Sun, Aug 16, 2009 at 4:49 PM, Ross Boylan<ross at biostat.ucsf.edu> wrote:
> Recorded here so others may avoid my mistakes.
>
> I have a bunch of files containing fixed width data.  The R Data guide
> suggests that one pre-process them with a script if they are large.
> They were 50MG and up, and I needed to process another file that gave
> the layout of the lines anyway.
>
> I tried rpy to not only preprocess but create the R data object in one
> go.  It seemed like a good idea; it wasn't.  The core operation, was to
> build up a string for each line that looked like "data.frame(var1=val1,
> var2=val2, [etc])" and then rbind this to the data.frame so far.  I did
> this with r(mycommand string). Almost all the values were numeric.
>
> This was incredibly slow, being unable to complete after running
> overnight.
>
> So, the lesson is, don't do that!
>
> I switched to preprocessing that created a csv file, and then read.csv
> from R.  This worked in under a minute.  The result had dimension 150913
> x 129.
>
> The good news in rpy was that I found objects persisted across calls to
> the r object.
>
> Exactly why this was so slow I don't know.  The two obvious suspects the
> speed of rbind, which I think is pretty inefficient, and the overhead of
> crossing the python/R boundary.
>
> This was on Debian Lenny:
> python-rpy                    1.0.3-2
> Python 2.5.2
> R 2.7.1
>
> rpy2 is not available in Lenny, though it is in development versions of
> Debian.
>
> Ross Boylan
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>