[R] read.delim problem with trailing spaces

Wed Oct 6 15:41:42 CEST 2004

On Wed, 6 Oct 2004, Michael Friendly wrote:

> I'm trying to read a comma delimited dataset that uses '.' for NA.  I 
> found that if the last field on a line was a missing '.'
> it was not read as NA, but just a '.', and the life variable was made a 
> factor.  The data looks like this,
> 
> income,imr,region,oilexprt,imr80,gnp80,life
> Afghanistan,75,400.0,4,0,185.0,.,37.5
> Algeria,400,86.3,2,1,20.5,1920,50.7
> Argentina,1191,59.6,1,0,40.8,2390,67.1
> Australia,3426,26.7,4,0,12.5,9820,71.0
> Austria,3350,23.7,3,0,14.8,10230,70.4
> Bangladesh,100,124.3,4,0,139.0,120,.
> Belgium,3346,17.0,3,0,11.2,12180,70.6
> Benin,81,109.6,2,0,109.6,300,.
> Bolivia,200,60.4,1,0,77.3,570,49.7
> Brazil,425,170.0,1,0,84.0,2020,60.7
> Britain,2503,17.5,3,0,12.6,7920,72.0
> Burma,73,200.0,4,0,195.0,180,42.3
>   ...
> 
> and I used
>  > nations <- 
> read.delim("~/sasuser/data/nations2.dat",na.strings=".",row.name=1,sep=",",header=TRUE)
> 
> > nations[1:10,]
>             income   imr region oilexprt imr80 gnp80 life
> Afghanistan     75 400.0      4        0 185.0    NA 37.5
> Algeria        400  86.3      2        1  20.5  1920 50.7
> Argentina     1191  59.6      1        0  40.8  2390 67.1
> Australia     3426  26.7      4        0  12.5  9820 71.0
> Austria       3350  23.7      3        0  14.8 10230 70.4
> Bangladesh     100 124.3      4        0 139.0   120   .
> Belgium       3346  17.0      3        0  11.2 12180 70.6
> Benin           81 109.6      2        0 109.6   300   .
> Bolivia        200  60.4      1        0  77.3   570 49.7
> Brazil         425 170.0      1        0  84.0  2020 60.7
> > summary(nations$life)
>   .  27.0 31.6 32.0 32.6 34.5 35.0 36.0 36.7 36.9 37.1 37.2 37.5 38.5 38.8 40.5
>    2    1    1    1    1    1    2    1    1    1    1    1    1    3    1    1
> 40.6 41.0 41.2 42.3 43.5 43.7 44.9 45.1 46.8 47.5 47.6 49.0 49.7 49.9 50.0 50.5
>    1    6    1    4    1    1    1    1    1    3    1    3    1    1    2    1
> 
> 
> After much hair-pulling, I discovered that the data lines for Bangladesh
> and Benin contained a trailing space after the '.'.  Removing those made
> the problem go away, but that shouldn't happen and I wonder if this is
> still a potential problem for others.  I'm using R 1.8.1.

It should happen.  The entry there is ". " and that is not an NA string.
If you use a non-whitespace delimiter, all whitespace is significant.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595