[R] extracting data using strings as delimiters

Marc Schwartz marc_schwartz at comcast.net
Tue Sep 25 23:15:57 CEST 2007


On Tue, 2007-09-25 at 16:39 -0400, lucy b wrote:
> Dear List,
> 
> I have an ascii text file with data I'd like to extract. Example:
> 
> Year Built:  1873 Gross Building Area:  578 sq ft
> Total Rooms:  6 Living Area:  578 sq ft
> 
> There is a lot of data I'd like to ignore in each record, so I'm
> hoping there is a way to use strings as delimiters to get the data I
> want (e.g. tell R to take data between "Built:" and "Gross" -
> incidentally, not always numeric). I think an ugly way would be to
> start at the end of each record and use a substitution expression to
> chip away at it, but I'm afraid it will take forever to run. Is there
> a way to use strings as delimiters in an expression?
> 
> Thanks in advance for ideas.
> 
> LB

I don't know that any of the default base functions enable the use of a
regex as a delimiter. If your text file is consistent in the use of the
colon ':' as a separator, you might be able to use that. Each of the
above lines then would be broken into 3 fields using:

DF <- read.table("YourFile.txt", sep = ":")

> DF
           V1                         V2          V3
1  Year Built   1873 Gross Building Area   578 sq ft
2 Total Rooms              6 Living Area   578 sq ft


You could then parse them further using appropriate functions if needed,
such as gsub():

> as.data.frame(lapply(DF[, -1], function(x) gsub("[^0-9]", "", x)))
    V2  V3
1 1873 578
2    6 578


This now gives you the numeric data in two columns. You would now need
to know that data in the rows are perhaps in some predictable or
alternating order for further processing.  See ?gsub and ?regex for more
information.

Hope that provides some help. You also might want to look at ?readLines
and ?strsplit as other ways to read in the data and then post-process it
once in an R object.

Marc Schwartz



More information about the R-help mailing list