[R] sscanf equivalent

Paul Roebuck roebuck at wotan.mdacc.tmc.edu
Sun Oct 9 09:36:38 CEST 2005


On Fri, 7 Oct 2005, Prof Brian Ripley wrote:

> On Fri, 7 Oct 2005, Paul Roebuck wrote:
>
> > I have a data file from which I need to read portions of
> > data but data location/quantity can change from file to file.
> > I wrote some code and have a working solution but it seems
> > wasteful to have to do it this way. Here's the contrived
> > incomplete code.
> >
> >    datalines <- readLines(datafile.pathname)
> >    # marker will appear on line preceding and following
> >    # actual data
> >    offset.data <- grep("marker", datalines)
> >    datalines <- NULL
> >
> >    # grab first column of each assoc dataline
> >    data <- scan(datafile.pathname,
> >                 what = numeric(0),
> >                 skip = offset.data[1],
> >                 nlines = offset.data[2]-offset.data[1]-1,
> >                 flush = TRUE,
> >                 multi.line = FALSE,
> >                 quiet = TRUE)
> >    # output is vector of values
> >
> > Originally wrote code to parse data from 'datalines'
> > using sub and strsplit methods but it was woefully slower
> > and more complex than using scan method. What is desired
> > is a means of invoking method like scan but with existing
> > data instead of filename.
>
> Why not use a text connection?

I tried that but result was far slower than the method above.

R> file.info(datafile.pathname)$size
[1] 944850
R> system.time(datalines<-readLines(datafile.pathname), TRUE)[3]
[1] 0.59
R> length(datalines)
[1] 67931
R> system.time(tconn<-textConnection(datalines), TRUE)[3]
[1] 52.97

Once a textConnection object was created, the scan method
invocation using it took less than half the time of the
corresponding filename-based invocation. Problem is that
this was only taking a second to perform the scan using
the filename-based invocation. And since grep method doesn't
accept textConnection as argument, I still require the
otherwise unused 'datalines' variable and its associated
memory. Even if grep supported such, the timing increased
even more not having the variable.

R> system.time(tconn<-textConnection(readLines(datafile.pathname)), TRUE)[3]
[1] 66.61


Any other thoughts?


# R version 2.1.1, 2005-06-20, powerpc-apple-darwin7.9.0

----------------------------------------------------------
SIGSIG -- signature too long (core dumped)




More information about the R-help mailing list