[R] reading formatted txt file into a data frame

Gabor Grothendieck ggrothendieck at gmail.com
Thu May 6 20:38:43 CEST 2010


This is very similar to the solution in Jim's post
except the regular expressions can be made
slightly simpler due to the use of strapply and a
few of the regular expressions have been made a
bit different even apart from that.  Its not
always clear what the general case is based on example
so the regular expressions may need to be tweaked
once the full data is available but this does work
on the sample shown.

Here:

\\d+ means one or more digits
[^]]+[^] ] means one or non-] characters followed by a
   final character which is neither ] nor space
\\S+ means one or more non-space characters
\\S+ . (.*) means one or more non-space characters followed by
   space followed by any character followed by space followed by any
   sequence of characters

In each case the portion of the regular expression
in parentheses is captured and returned by
strapply.


library(gsubfn)

# input is input data as in Jim's post
data.frame(ID = strapply(input, "ID: (\\d+)", c, simplify = TRUE),
   Writer = strapply(input, "Writer: ([^]]+[^] ])", c, simplify = TRUE),
   Rating = strapply(input, "Rating: (\\S+)", c, simplify = TRUE),
   Text = strapply(input, "Rating: \\S+ . (.*)", c, simplify = TRUE),
   stringsAsFactors = FALSE)



On Thu, May 6, 2010 at 12:24 PM, jim holtman <jholtman at gmail.com> wrote:
> Try this:
>
>> cat(c("[ID: 001 ] [Writer: Steven Moffat ] [Rating: 8.9 ] Doctor Who",
> +       "[ID: 002 ] [Writer: Joss Whedon ] [Rating: 8.8 ] Buffy",
> +       "[ID: 003 ] [Writer: J. Michael Straczynski ] [Rating: 7.4
> ]Babylon"),
> +       sep = "\n", file = "tmp.txt")
>>
>> # read in the data and parse it assuming it has the same structure
>> input <- readLines('tmp.txt')
>> # parse it item by item
>> x.id <- sub(".*\\[ID: ([[:digit:]]+).*", "\\1 <file://0.0.0.1/>", input)
>> x.writer <- sub(".*\\[Writer:([^]]+).*", '\\1', input)
>> x.rating <- sub(".*\\[Rating: ([0-9.]+).*", '\\1', input)
>> x.prog <- sub(".*\\](.*)", '\\1', input)
>> #create dataframe
>> data.frame(id=x.id, writer=x.writer, rating=x.rating, prog=x.prog)
>   id                   writer rating        prog
> 1 001           Steven Moffat     8.9  Doctor Who
> 2 002             Joss Whedon     8.8       Buffy
> 3 003  J. Michael Straczynski     7.4     Babylon
>>
>
>
> On Thu, May 6, 2010 at 9:58 AM, Tony B <tony.breyal at googlemail.com> wrote:
>
>> Dear all
>>
>> Lets say I have a plain text file as follows:
>>
>> > cat(c("[ID: 001 ] [Writer: Steven Moffat ] [Rating: 8.9 ] Doctor Who",
>> +       "[ID: 002 ] [Writer: Joss Whedon ] [Rating: 8.8 ] Buffy",
>> +       "[ID: 003 ] [Writer: J. Michael Straczynski ] [Rating: 7.4 ]
>> Babylon [5]"),
>> +       sep = "\n", file = "tmp.txt")
>>
>> I would somehow like to read in this file to R and covert it into a
>> data frame like this:
>>
>> > DF <- data.frame(ID = c("001", "002", "003"),
>> +                 Writer = c("Steven Moffat", "Joss Whedon", "J.
>> Michael Straczynski"),
>> +                 Rating = c("8.9", "8.8", "7.4"),
>> +                 Text = c("Doctor Who", "Buffy", "Babylon [5]"),
>> stringsAsFactors = FALSE)
>>
>>
>> My initial thoughts were to use readLines on the text file and maybe
>> do some regular expressions and also use strsplit(..); but having
>> confused myself after several attempts I was wondering if there is a
>> way, perhaps using maybe read.table instead?  My end goal is to
>> hopefully convert DF into an XML structure.
>>
>> Thank you kindly in advance for your time,
>> Tony Breyal
>>
>> # Windows Vista
>> > sessionInfo()
>> R version 2.11.0 (2010-04-22)
>> i386-pc-mingw32
>>
>> locale:
>> [1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United
>> Kingdom.1252    LC_MONETARY=English_United Kingdom.1252
>> LC_NUMERIC=C                            LC_TIME=English_United Kingdom.
>> 1252
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods
>> base
>>
>> other attached packages:
>> [1] XML_2.8-1
>>
>> loaded via a namespace (and not attached):
>> [1] tools_2.11.0
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem that you are trying to solve?
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list