[Rd] read.table() errors with tab as separator (PR#9061)

Wed Jul 5 13:26:41 CEST 2006

On Wed, 5 Jul 2006, Peter Dalgaard wrote:

> Prof Brian Ripley <ripley at stats.ox.ac.uk> writes:
>
>> On Wed, 5 Jul 2006, Peter Dalgaard wrote:
>>
>>> John.Maindonald at anu.edu.au writes:
>>>
>>>> (1) read.table(), with sep="\t", identifies 13 our of 1400 records,
>>>> in a file with 1400 records of 3 fields each, as having only 2 fields.
>>>> This happens under version 2.3.1 for Windows as well as with
>>>> R 2.3.1 for Mac OS X, and with R-devel under Mac OS X.
>>>> [R version 2.4.0 Under development (unstable) (2006-07-03 r38478)]
>>>>
>>>> (2) Using read.table() with sep="\t", the first 1569 records only
>>>> of a 1821 record file are input.  The file has exactly two fields
>>>> in each record, and the minimum length of the second field is
>>>> 1 character.  If however I extract lines 1561 to 1650 from the
>>>> file (the file "short.txt" below), all 90 lines are input.
>>>
>>> Notice that the single quote is a quote character in read.table (as
>>> opposed to read.delim, which uses only the double quote, to cater for
>>> TAB-separated files from Excel & friends).
>>>
>>>> [1] "865\tlinear model (lm)! Cook's distance\t152"
>>>                                   ^
>>>                                 !!!!
>>>
>>> (This reminds me that we probably should shift the default for
>>> comment.char too since it leads to similar issues, but it seems not to
>>> be the problem in this case.)
>>
>> This seems to imply that we should change the default for 'quote': to
>> do so could break a lot of scripts.  (Given how long the default has
>> been
>> comment.char="#", I doubt if we should change that either.)
>
> Sorry, unclear. We already change quote= for read.delim and read.csv,
> and I was suggesting also to modify the default for comment.char for
> those functions, but definitely not for read.table.
>
> Arguably, those functions are there to handle file formats generated
> by other programs, and it is unlikely that such programs will generate
> comment lines starting with #, whereas we have learned that Excel will
> occasionally generate fields like #NULL#, which mess up the parsing.

Ah, that does seem a sensible defensive move.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595