[R] Reading .csv file under linux

Wed Jan 23 22:41:24 CET 2008

Just as an update on encoding (which may or may not be of interest). I 
changed the read.csv command for three .csv files I was reading to specify 
the encoding to be

encoding="CP1252"

and all 3 files were read in without problems on linux. Last night I 
swapped the analysis back on to my windows machine, and one of the reads 
stopped part way through with a message about illegal characters. I 
checked around where the read stopped but couldn't see what the problem 
was. Dropping the encoding argument to "file" worked around the problem.

I now have an if then else which tests what system I am on. Painful but at 
least it is system independent.

Thanks again

David

On Tue, 22 Jan 2008, Prof Brian Ripley wrote:

> On Wed, 23 Jan 2008, David Scott wrote:
>
>> On Tue, 22 Jan 2008, Prof Brian Ripley wrote:
>> 
>>> On Wed, 23 Jan 2008, David Scott wrote:
>>> 
>>>> 
>>>> I have encountered a problem with reading a .csv file on a linux box. I
>>>> can read the file on my windows machine (under XP) but on the linux box 
>>>> it
>>>> gives :
>>>> 
>>>>> patients <- read.csv("../Patients.csv", header = FALSE,
>>>> +                      col.names = patientsNames)
>>>> Error in type.convert(data[[i]], as.is = as.is[i], dec = dec,
>>>> na.strings = character(0)) :
>>>>   invalid multibyte string
>>>> Calls: read.csv -> read.table -> type.convert
>>>> Execution halted
>>>> 
>>>> I am running R 2.6.1 on both machines. I tried on another linux box
>>>> running 2.5.1 and got the same problem
>>>> 
>>>> I am guessing it is something to do with the character encoding. On the
>>>> linux box I have
>>>> 
>>>> LANG=en_US.UTF-8
>>> 
>>> So what encoding is the .csv file in?  Consider the example at the end of 
>>> ?file
>>>
>>>     ## examples of use of encodings
>>>     cat(x, file = file("foo", "w", encoding="UTF-8"))
>>>     # read a 'Windows Unicode' file including names
>>>     A <- read.table(file("students", encoding="UCS-2LE"))
>>> 
>>> and adapt accordingly (encoding = "CP1252" is the most likely value if 
>>> this works in English-language Windows).
>>> 
>> 
>> 
>> Thanks Brian for the super-quick, super-helpful reply. The encoding you 
>> suggested worked.
>> 
>> I found a workaround myself too---I guessed that some plus/minus signs 
>> might be the problem and replaced them and could read in the file.
>> That is just a kludge so I am using the encoding specification.
>> 
>> I am a total dunce when it comes to encodings though. How do you find the 
>> encoding of a file?
>
> You ask the person who gave it to you.  You can't in general tell, and e.g. 
> ISO-8859-1 and ISO-8859-2 are only distinguishable by someone who can read 
> the contents (if it is a human language).  If you have just the odd symbol 
> (e.g. degree sign or plus/minus) you can be completely stuck.
>
> 'file' on Linux can usually guess if a file is UTF-8 or ISO-8859-?, but not 
> of course what ? is.  But guesses are based on statistical patterns and are 
> good for text but not so good for data.
>
> -- 
> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>

_________________________________________________________________
David Scott	Department of Statistics, Tamaki Campus
 		The University of Auckland, PB 92019
 		Auckland 1142,    NEW ZEALAND
Phone: +64 9 373 7599 ext 86830		Fax: +64 9 373 7000
Email:	d.scott at auckland.ac.nz

Graduate Officer, Department of Statistics
Director of Consulting, Department of Statistics