[R] read.delim()

Ben Bolker bbolker at gmail.com
Thu Jul 29 18:48:12 CEST 2010


Doran, Harold <HDoran <at> air.org> writes:

> 
> Thank you, Phil. Unfortunately, there are quotes used properly elsewhere. 
> ----- Original Message -----
> From: Phil Spector <spector <at> stat.berkeley.edu>
> To: Doran, Harold
> Cc: r-help <at> r-project.org <r-help <at> r-project.org>
> Sent: Wed Jul 28 18:29:32 2010
> Subject: Re: [R] read.delim()
> 
> Harold -
>     If there aren't any true quoted fields in the file, you 
> could  pass the quote="" option to read.delim().
> 
>  					- Phil Spector
>  					 Statistical Computing Facility
>  					 Department of Statistics
>  					 UC Berkeley
>  					 spector <at> stat.berkeley.edu
> 
> On Wed, 28 Jul 2010, Doran, Harold wrote:
> 
> > I am reading in a very large file with names in it and R is truncating the
number of rows it reads in. The
> separator in this file is a pipe '|' and so I use
> >
> > dat <- read.delim('pathToMyFile', header= TRUE, sep='|')
> >
> > It turns out that it is reading up to row 61145 and stopping and I think I
see why, but am not sure of the best
> solution to this problem. I see the name of the person in the next row has a
quote in it, such as:
> >
> > Joe Sm"ith
> >
> > I *think* this is causing a problem in the read in. In fact, whenever I use
> >
> >
> > ?  tail(dat)
> >
> > ?  or dat[61145,]
> >
> > R crashes.
> >
> > But, it doesn't crash when I use head(dat) or index any other row. 
> I could change my raw data and manually
> delete this ". However, is there another solution within the
> args of read.delim that would be useful as a
> solution such that I would not have to manually change my raw data
> >
> > Harold


  Does R actually 'crash' (i.e., stop/segmentation fault/etc.)?
Or does it just give you an error message?

  Assuming that the bad cases are always represented by a *single*
quotation mark on a line, you could find them by reading in the
whole file with r <- readLines(...) [assuming the file is small enough
to suck into memory whole] and do something like

 sapply(strsplit(r,""),function(x) sum(x=="\""))

to find the bad lines.  There are certainly many more pathological
cases (what if there are (good) paired quotes and (bad) unpaired
quotes on the same line?  What if there are two (bad) unpaired quotes
on the same line? 

  Sounds like it's time to do some manual editing.



More information about the R-help mailing list