[R] reading a file incrementally

Scot W. McNary smcnary at charm.net
Tue Jun 3 04:51:42 CEST 2008


Hi,

I'm trying to read a file containing html markup (discussion board 
posts) and output the various parts of each post to an field in a record 
in an output file (date, author, title, body).  This is a one-off job 
and I'm trying to use R to do it. 

The file looks something like this:

<br><ul>Created: --- - Dr. Johnsons's article -
concerns<br><p>After writing some text..........</p><br>-- by
Anon.<br>
<br><ul>Created: --- - RE:Dr. Johnson's article - concerns<br><p>With
some advance notice about some text........<br>-- by
Anon.<br>

So that <br>tags indicate where the field entries begin and end and 
"Created" and "by Anon" indicate the beginning and ending of the post.

The file is named "Module_1.txt". Here is what I have so far:

## adapted from http://finzi.psych.upenn.edu/R/Rhelp02a/archive/64261.html
## gives post beginning and ending points
starts <- gregexpr("Created", readChar("Module_1.txt", 
file.info("Module_1.txt")$size))[[1]]

ends <- gregexpr("by Anon", readChar("Module_1.txt", 
file.info("Module_1.txt")$size))[[1]]

## open connection
chk <- file("Module_1.txt", "r")
#seek(chk, origin = "start", ends[1]) # moves through file

## initalize an array
hold <- array(rep(NA, length(starts)))

## write a function to read the connection
catchtext <- function(start, end, source) {
    for(i in 1:length(starts)) 
{                                                                  
        hold[i] <- readChar(chk, nchar = ends[i] - starts[i])
    }
    }

#close(chk)

and my output is
 > hold
  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 
NA NA NA
 [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 
NA NA NA
 [51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 
NA NA NA
 [76] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 
NA NA NA
[101] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 
NA NA NA
[126] NA NA

At this point I'm just trying to get the entire file cut up into 
post-sized chunks.  Later I'll go through and output the separate bits 
into fields.  As can be seen, I'm having trouble moving through the 
connection to the places I want to read from it.  Suggestions welcome.

Thanks,

Scot

 > version
               _                          
platform       i386-pc-mingw32            
arch           i386                       
os             mingw32                    
system         i386, mingw32              
status                                    
major          2                          
minor          6.2                        
year           2008                       
month          02                         
day            08                         
svn rev        44383                      
language       R                          
version.string R version 2.6.2 (2008-02-08)

-- 
Scot McNary
smcnary at charm dot net



More information about the R-help mailing list