[R] A weird observation from using read.table

Charles C. Berry cberry at tajo.ucsd.edu
Thu Sep 27 20:53:34 CEST 2007


On Thu, 27 Sep 2007, Jun Ding wrote:

> Hi Everyone,
>
> Recently I got puzzled by the function read.table,
> even though I have used it for a long time.
>
> I have such a file (tmp.txt, 2 rows and 3 columns,
> with a space among columns):
>
> 1 2'-PDE 4
> 2 3'-PDE 5
>
> if I do:
> a = read.table("tmp.txt", header = F, quote = "")
> a
>  V1     V2 V3
> 1  1 2'-PDE  4
> 2  2 3'-PDE  5
>
> Everything is fine.
>
> However, if I do:
> a = read.table("tmp.txt", header = F)
> a
>  V1     V2 V3
> 1  2 3'-PDE  5
> 2  1 2'-PDE  4
> 3  2 3'-PDE  5
>
> I know it is related to the "quote" as the default
> includes '. But how can it get one more row in the
> file? Thank you very much for your help in advance!


read.table does a lot of work trying to figure out what kind of data it 
will see and doing preliminary checks on it before swallowing the whole 
file. It reads the first 5 lines of data thru a file() connection - if 
there are five lines - and then tries to pushBack() two copies of those 
lines. Then it rereads half of these and skips the extra header row if 
there is one. At that point, it should be positioned to read all of the 
data that was in the original file.

Declaring a quote that should not be a quote really messes this up. I 
think this happens because the internal function readTableHead will ignore 
newlines that are between quotes. In your example all of the data is read 
by readTableHead as one line because of a quote on the first line, and 
this has downstream consequences that result in not repositioning the 
connection at the right place. And that leads to reading two copies of the 
second line in your example.

If you want more details, use debug(read.table) and then run your 
examples. print 'lines', 'nlines', and 'pushBackLength( file )' at various 
points in the execution of read.table and you can see what is happening.

HTH,

Chuck


>
> Jun
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901



More information about the R-help mailing list