[Rd] Bug in read.table?
pdalgd at gmail.com
Tue Nov 16 14:04:16 CET 2010
On Nov 16, 2010, at 02:59 , Ben Bolker wrote:
> Ben Bolker <bbolker <at> gmail.com> writes:
>> Ben Bolker <bbolker <at> gmail.com> writes:
>> Can simplify this still farther:
>> a b'c
>> d e'f
>> g h'i
> This example file leads to duplicate lines.
> Arguably it should have behavior analogous to:
> 1: a b'c
> 3: d e'f
> 5: g h'i
> 7: Read 6 items
>  "a" "b'c" "d" "e'f" "g" "h'i"
>>> One of the first things that happens in read.table is that
>>> the first few lines are read with readTableHead:
>>> lines <- .Internal(readTableHead(file, nlines, comment.char,
>>> blank.lines.skip, quote, sep))
>> in this case, this reads the first two lines as one line;
>> the single quote at pos. 4 of the first line closes on pos.
>> 4 of the second line, preventing the first new line from
>> ending a line.
>> R then pushes back two copies of the lines that have
>> been read (this is normal behavior; I don't quite follow the
>> The rest of the file is read with scan(), 1 line at a time.
>> However, there is the discrepancy between the way
>> that readTableHead interprets new lines in the middle of
>> quoted strings (it ignores them) and the way that scan()
>> interprets them (it takes them as the end of the quoted string).
> I think this counts as a small, but real, bug. Should I go ahead
> and report it as such, or would someone explain why it's not a bug?
I think it can be defended to file as a bug, but it is tricky to pinpoint exactly what the issue is. E.g., notice that adding a few spaces changes the behaviour of scan() considerably:
1: a b 'c
1: d e' f
5: g h' i
Read 7 items
 "a" "b" "c\nd e" "f" "g" "h'" "i"
(I'm confused... What is it that we really want here?)
Also, as you noted originally, beware the "Well don't do that then" aspect...
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
More information about the R-devel