[R] Programcode and data in the same textfile

Ernst Hansen erhansen at math.ku.dk
Fri Jun 13 15:12:02 CEST 2003


My request for a way of having both data and R-code in the same
textfile, resultet in a considerable number of very good suggestions,
that I will now summarize.

The boundary conditions for the problem were as follows: the data
should be written in the textfile in a format that was readable to the
human eye. And this ruled out the 'transposed' way of writing the
data, that is used in most help-files, eg. in ?model.matrix.

As the purpose of the exercise is to make the textfile easy to read,
there is a limit to how complicated the extra code should be -
otherwise it would make matters worse.   I don't know if any of the
solutions below qualify in this sense - but I surely learned a lot
from them.






The most popular idea was using textConnection() in a combination with
read.table().  For instance Thomas Hotz wrote it like


# Solution by Thomas Hotz

   MyFrame <- read.table(textConnection(c(

    'Sex    Respons',
    'Male   1',
    'Male   2',
    'Female 3',
    'Female 4'

   )), header = T)


Gabor Grothendieck had a similar solution. James Holtman provided a
nifty trick to get rid of the strategically placed commas and
quotations, using escaped carriagereturns,

# Solution by James Holtman

   MyFrame <- read.table(textConnection('\
    Sex    Respons \
    Male   1 \
    Male   2 \
    Female 3 \
    Female 4 \
   '), header = T, skip = 1)
 

Duncan Temple Lang suggested that the entire textfile should be
wrapped up as XML, and parsed via the XML package.  In the context of
me and my students, I think that this would be overkill, and I also
think it necessarily breaks the one-file boundary condition, but in a
larger context it seems like an excellent advise.

# Solution by Duncan Tempel Lang

# Content of myFile.q
   <doc>
   <data>
    Sex    Response
    Male   1
    Male   2
    Female 3
    Female 4
   </data>

   <code>
    ......
   </code>
   </doc>

To read the data,

 tr = xmlRoot(xmlTreeParse("myFile.q"))
 read.table(textConnection(xmlValue(tr[["data"]])), header=TRUE)

and to access the code text

 xmlValue(tr[["code"]])




A number of approaches not based on textConnection() emerged, though.

Torsten Hothorn suggested that the data should be surrounded by some
kind of  print-statement, writing it to a temporary file.  Then
read.table() could be used to retrieve the data:

# Torsten Hothorns solution:

  tmpfilename <- tempfile()
  tmpfile <- file(tmpfilename, 'w')
  cat(
      'Sex    Respons',
      'Male   1',
      'Male   2',
      'Female 3',
      'Female 4',
      file = tmpfile, sep='\n')
  close(tmpfile)
  read.table(tmpfilename, header = TRUE)



Barry Rowlingson suggested that the data should be written as a vector
of characters, and then shaped by hand:

# Barry Rowlingsons solution

   data <- c(

             'Sex', 'Respons',
             'Male',   1,
             'Female', 2,
             'Male',   3,
             'Male',   2,

             )

   ncol <- 2
   nrow <- length(data)/ncol

   heads <- data[1:ncol];data <- data[-(1:ncol)]
   asDF <- data.frame(matrix(data,ncol=ncol,byrow=T))

   asDF[,2] <- as.numeric(asDF[,2])
   names(asDF) <- heads


Finally, Thomas Blackwell and Greg Louis implemented a nice idea,
where the data are commented out in the textfile, but where a call to
read.table() from within the file, makes it read exactly those lines,
using a different convention for comments:

# Greg Louis' solution

   MyFrame <- read.table('myFile.q', header = T, 
                      skip = 28, nrows = 4, comment.char="")[-1]
   # Sex    Respons 
   # Male   1 
   # Male   2 
   # Female 3 
   # Female 4 

Exactly how lines that will need to be skipped depends on the
circumstances. nrows is the number of cases in the dataframe.     
 


The original request follows below.

Thank you all for participating.


Ernst Hansen
Department of Statistics
University of Copenhagen




Ernst Hansen writes:
 > I have the following problem.  It is not of earthshaking importance,
 > but still I have spent a considerable amount of time thinking about
 > it. 
 > 
 > PROBLEM: Is there any way I can have a single textfile that contains
 > both
 > 
 >   a) data
 > 
 >   b) programcode
 > 
 > The program should act on the data, if the textfile is source()'ed
 > into R.
 > 
 > 
 > BOUNDARY CONDITION: I want the data written in the textfile in exactly
 > the same format as I would use, if I had data in a separate textfile,
 > to be read by read.table().  That is, with 'horizontal inhomogeneity'
 > and 'vertical homogeneity' in the type of entries.  I want to write
 > something like 
 > 
 >       Sex    Respons
 >       Male   1
 >       Male   2
 >       Female 3
 >       Female 4
 > 
 > In effect, I am asking if there is some way I can convince
 > read.table(), that the data is contained in the following n lines of
 > text. 
 > 
 > 
 > ILLEGAL SOLUTIONS:
 > I know I can simulate the behaviour by reading the columns of the
 > dataframe one by one, and using data.frame() to glue them together.
 > Like in 
 > 
 >     data.frame(Sex = c('Male', 'Male', 'Female', 'Female'),
 >                Respons = c(1, 2, 3, 4))
 > 
 > I do not like this solution, because it represents the data in a
 > "transposed" way in the textfile, and this transposition makes the
 > structure of the dataframe less transparent - at least to me. It
 > becomes even less comprehensible if the Sex-factor above is written
 > with the help of rep() or gl() or the like.
 > 
 > I know I can make read.table() read from stdin, so I could type the
 > dataframe at the prompt.  That is against the spirit of the problem,
 > as I describe below.
 > 
 > 
 > I know I can make read.table() do the job, if I split the data and the
 > programcode in to different files.  But as the purpose of the exercise
 > is to distribute the data and the code to other people, splitting
 > into several files is a complication.
 > 
 > 
 > MOTIVATION: I frequently find myself distributing small chunks of code
 > to my students, along with data on which the code can work.
 > 
 > As an example, I might want to demonstrate how model.matrix() treats
 > interactions, in a certain setting.  For that I need a dataframe that
 > is complex enough to exhibit the behaviour I want, but still so small
 > that the model.matrix is easily understood.  So I make such a
 > dataframe.
 > 
 > I am trying to distribute this dataframe along with my code, in a way
 > that is as simple as possible to USE for the students (hence the
 > one-file boundary condition) and to READ (hence the non-transposition
 > boundary condition).
 > 
 > 
 > 
 > Does anybody have any ideas?
 > 
 > 
 > Ernst Hansen
 > Department of Statistics
 > University of Copenhagen
 > 
 > ______________________________________________
 > R-help at stat.math.ethz.ch mailing list
 > https://www.stat.math.ethz.ch/mailman/listinfo/r-help
 >




More information about the R-help mailing list