[R] retaining characters in a csv file

Daniel Nordlund djnordlund at frontier.com
Wed Sep 23 21:05:33 CEST 2015


On 9/23/2015 5:57 AM, Therneau, Terry M., Ph.D. wrote:
> Thanks for all for the comments, I hadn't intended to start a war.
>
> My summary:
>   1. Most important: I wasn't missing something obvious.  This is 
> always my first suspicion when I submit something to R-help, and it's 
> true more often than not.
>
>   2. Obviously (at least it is now), the CSV standard does not specify 
> that quotes should force a character result.  R is not "wrong".  Wrt 
> to using what Excel does as litmus test, I consider that to be totally 
> uninformative about standards: neither pro (like Duncan) or anti (like 
> Rolf), but simply irrelevant.  (Like many MS choices.)
>
>   3. I'll have to code in my own solution, either pre-scan the first 
> few lines to create a colClasses, or use read_csv from the readr 
> library (if there are leading zeros it keeps the string as character, 
> which may suffice for my needs), or something else.
>
>   4. The source of the data is a "text/csv" field coming from an http 
> POST request.  This is an internal service on an internal Mayo server 
> and coded by our own IT department; this will not be the first case 
> where I have found that their definition of "csv" is not quite standard.
>
> Terry T.
>
>
>
>> On 23/09/15 10:00, Therneau, Terry M., Ph.D. wrote:
>>> I have a csv file from an automatic process (so this will happen
>>> thousands of times), for which the first row is a vector of variable
>>> names and the second row often starts something like this:
>>>
>>> 5724550,"000202075214",2005.02.17,2005.02.17,"F", .....
>>>
>>> Notice the second variable which is
>>>        a character string (note the quotation marks)
>>>        a sequence of numeric digits
>>>        leading zeros are significant
>>>
>>> The read.csv function insists on turning this into a numeric. Is there
>>> any simple set of options that
>>> will turn this behavior off?  I'm looking for a way to tell it to "obey
>>> the bloody quotes" -- I still want the first, third, etc columns to
>>> become numeric.  There can be more than one variable like this, and not
>>> always in the second position.
>>>
>>> This happens deep inside the httr library; there is an easy way for me
>>> to add more options to the read.csv call but it is not so easy to
>>> replace it with something else.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

A fairly simple workaround is to add two lines of code to the process, 
and then add the colClasses parameter as you suggested in item 2 above.

want <- read.csv('yourfile', quote='', stringsAsFactors= FALSE, nrows=1)
classes <- sapply(want, class)
want <- read.csv('yourfile', stringsAsFactors= FALSE, colClasses=classes)

I don't know if you want your final file to convert strings to factors, 
so you can modify as needed.  In addition, if your files aren't as 
regular as I inferred, you can increase the number of rows to read in 
the first line to ensure getting the classes right.


Hope this is helpful,

Dan

-- 
Daniel Nordlund
Bothell, WA  USA



More information about the R-help mailing list