[R] good and bad ways to import fixed column data (rpy)

Ross Boylan ross at biostat.ucsf.edu
Mon Aug 17 06:16:29 CEST 2009


Just to quote explicitly the passage I mentioned in the R Data document:


<QUOTE>
   Function `read.fwf' provides a simple way to read such files,
specifying a vector of field widths.  The function reads the file into
memory as whole lines, splits the resulting character strings, writes
out a temporary tab-separated file and then calls `read.table'.  This
is adequate for small files, but for anything more complicated we
recommend using the facilities of a language like `perl' to pre-process
the file.  
</QUOTE>

Note particularly the final sentence.

Ross



On Sun, 2009-08-16 at 19:37 -0400, Wensui Liu wrote:
> Gabor made a good point.
> Here is an example I copied from my blog.
>  
> ##############################################
> # READ FIXED-WIDTH DATA FILE WITH read.fwf() #
> # ------------------------------------------ #
> # EQUIVALENT SAS CODE:                       #
> # filename data 'E:\sas\fixed.txt';          #
> # data test;                                 #
> #   infile data truncover;                   #
> #   input @1 city $ 1 - 22 @23 population;   #
> # run;                                       #
> ##############################################
> 
> # OPEN A CONNECTION TO THE DATA FILE
> data <- file(description = "e:\\sas\\fixed.txt", open = "r")
> 
> # width = c(...)      ==> SPECIFIES COLUMN WIDTHS
> # col.names = c(...)  ==> GIVES COLUMN NAMES
> # colClasses = c(...) ==> DEFINES COLUMN CLASSES
> test <- read.fwf(data, header = FALSE, width = c(22, 10), 
>                  col.names = c("city", "population"),
>                  colClasses = c("character", "numeric"))
> 
> close(data)
>  
> On Sun, Aug 16, 2009 at 6:36 PM, Gabor
> Grothendieck<ggrothendieck at gmail.com> wrote:
> > Check out ?read.fwf
> >
> > On Sun, Aug 16, 2009 at 4:49 PM, Ross Boylan<ross at biostat.ucsf.edu>
> wrote:
> >> Recorded here so others may avoid my mistakes.
> >>
> >> I have a bunch of files containing fixed width data.  The R Data
> guide
> >> suggests that one pre-process them with a script if they are large.
> >> They were 50MG and up, and I needed to process another file that
> gave
> >> the layout of the lines anyway.
> >>
> >> I tried rpy to not only preprocess but create the R data object in
> one
> >> go.  It seemed like a good idea; it wasn't.  The core operation,
> was to
> >> build up a string for each line that looked like
> "data.frame(var1=val1,
> >> var2=val2, [etc])" and then rbind this to the data.frame so far.  I
> did
> >> this with r(mycommand string). Almost all the values were numeric.
> >>
> >> This was incredibly slow, being unable to complete after running
> >> overnight.
> >>
> >> So, the lesson is, don't do that!
> >>
> >> I switched to preprocessing that created a csv file, and then
> read.csv
> >> from R.  This worked in under a minute.  The result had dimension
> 150913
> >> x 129.
> >>
> >> The good news in rpy was that I found objects persisted across
> calls to
> >> the r object.
> >>
> >> Exactly why this was so slow I don't know.  The two obvious
> suspects the
> >> speed of rbind, which I think is pretty inefficient, and the
> overhead of
> >> crossing the python/R boundary.
> >>
> >> This was on Debian Lenny:
> >> python-rpy                    1.0.3-2
> >> Python 2.5.2
> >> R 2.7.1
> >>
> >> rpy2 is not available in Lenny, though it is in development
> versions of
> >> Debian.
> >>
> >> Ross Boylan
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>  
>  
>  
> -- 
> ==============================
> WenSui Liu
> Blog   : statcompute.spaces.live.com
> Tough Times Never Last. But Tough People Do.  - Robert Schuller
> ==============================
>  
>




More information about the R-help mailing list