[Rd] Bug in read.table?
bbolker at gmail.com
Mon Nov 8 00:38:18 CET 2010
Ben Bolker <bbolker <at> gmail.com> writes:
> <jgarcia <at> ija.csic.es> writes:
> > Thanks. Yes, quote="" solves the problem.
> > I would never say, however, from the documentations, that this was causing
> > the duplicate records. Rather, I would have expected some kind of
> > warning/error message.
> > And, yes, I knew that, through duplicate(), R solves gracefully this
> > specific problem. Just thought this could be of interests for R devel.
> A bit of a meta- point here: there may indeed be a bug here
> (it's the kind of obscure "corner case" that someone may not have
> tested), but it's unlikely to get noted as such and fixed unless you
> can come up with a clear analysis of what is happening and how the
> misinterpretation of quote characters is leading to duplication of
> records. (You, or someone else -- recognizing that this may be beyond
> your skill level. It might be that 'just' very careful thought
> and analysis of the behavior described in the documentation would
> explain this, or one might have to dig through source code in R or C.)
> Problems with unescaped/unrecognized quote characters are very
> Otherwise, this will likely be dismissed as a ("doctor, it hurts
> when I do this"; "well then, don't do that!") sort of situation.
> Ben Bolker
Following up on my own point:
The bottom line is that the internal readTableHead() command
handles newlines within quoted strings differently from scan().
a simpler file that replicates the problem is
(didn't want to try reading this from a textConnection --
escaping all the quotes properly would have driven me nuts).
One of the first things that happens in read.table is that
the first few lines are read with readTableHead:
lines <- .Internal(readTableHead(file, nlines, comment.char,
blank.lines.skip, quote, sep))
in this case, this reads the first two lines as one line;
the single quote at pos. 4 of the first line closes on pos.
4 of the second line, preventing the first new line from
ending a line.
R then pushes back two copies of the lines that have
been read (this is normal behavior; I don't quite follow the
The rest of the file is read with scan(), 1 line at a time.
However, there is the discrepancy between the way
that readTableHead interprets new lines in the middle of
quoted strings (it ignores them) and the way that scan()
interprets them (it takes them as the end of the quoted string).
In particular, if the file "tmp3.txt" is as shown above, then
 "a b'c\"d\"e\nf g'h\"i\"j"
(i.e. it grabs the first two lines, including the \n)
Read 2 items
 "a" "b'c\"d\"e"
(it terminates the line in the middle of the string opened
by the single quote).
I don't know what the consequences would be of changing
readTableHead to match scan()'s behavior, or how much
trouble it would be to do so.
More information about the R-devel