[R] CSV value not being read as it appears

David Scott d.scott at auckland.ac.nz
Sat Jan 15 01:11:59 CET 2011


I am a bit confused by this. You are doing a transfer from Excel (.xls 
or .xlsx) to .csv, then a subset in R and ending up with a couple of 
entries which are " Open" rather than "Open". So where are they coming 
from? You say they are not in the original Excel, so that suggests the 
transfer to .csv is the problem. I would be very surprised if the subset 
was a problem, but as others have indicated transfer to .csv can be 
downright ugly.

You can check the .csv file by opening it in an editor (I use Emacs). 
Just go to the line and have a look if the extra space is there nestling 
between two commas.

The other advice is, don't go through .csv. Go directly from Excel to R. 
My favourite tools are RODBC and xlsReadWrite for that step. Both work 
extremely well.

As others have indicated, the big bugbear in the .csv route is dates, or 
what Excel decides are dates. My experience was the conversion of New 
Zealand health ID numbers to dates. They are three letters then 4 
digits, so AUG2699 became a date.

David Scott

On 14/01/2011 10:58 p.m., bgreen at dyson.brisnet.org.au wrote:
> Hello David,
>
> Thanks for your e-mail. The data was a report derived from a statewide
> database, saved in EXCEL format, so the usual issue of the vagaries of
> human data entry variation wasn't the issue as the data was an automated
> report, which is run every three months. I would not have even noticed
> this problem if I hadn't been double checking the numbers of people by
> district. Visual inspection didn't reveal this problem - no white space
> was obvious and the spelling was identical. Tabulation via R wouldn't have
> detected this - I was obtaining the EXCEL totals via filter which I then
> compared with R output. I'm hoping I can skip this step, in future, with
> Jim's suggestion.
>
> regards
>
> Bob
>
>
>> As a further note, this is a reminder that whenever you get data via a
>> spreadsheet the first thing to do is examine it and clean up any
>> problems. A basic requirement is to tabulate any categorical variable.
>> Spreadsheets allow any sort of data to be entered, with no controls. My
>> experience is that those who enter data into spreadsheets enter all
>> sorts of variations of what a human would wish to treat as the same
>> ("Open", "Open  ", "open", etc.), even when told not to.
>>
>> David Scott
>>
>> On 14/01/2011 4:03 p.m., Jim Holtman wrote:
>>> try strip.white=TRUE to strip out white space
>>>
>>> Sent from my iPad
>>>
>>> On Jan 13, 2011, at 21:44, bgreen at dyson.brisnet.org.au wrote:
>>>
>>>>
>>>> I have a frustrating issue which I am hoping someone may have a
>>>> suggestion
>>>> about.
>>>>
>>>> I am running XP and R 2.12.0 and saved an EXCEL file that I was sent as
>>>> a
>>>> csv file.
>>>>
>>>> The initial code I ran follows.
>>>>
>>>> dec<- read.csv("g://FMH/FO30122010.csv",header=T)
>>>> dec.open<- subset (dec, Status == "Open")
>>>> table(dec.open$AMHS)
>>>>
>>>> I was checking the output and noticed a difference between my manual
>>>> count
>>>> and R output. Two subject's rows were not being detected by the subset
>>>> command:
>>>>
>>>> For the AMHS where there was a discrepancy I then ran:
>>>> wm<- subset (dec, AMHS == "WM")
>>>>
>>>> The problem appears to be that there is a space before the 'Open" value
>>>> for two indivduals, as per the example below.
>>>>
>>>> 10/02/2010  Open
>>>> 22/08/2007   Open
>>>>
>>>> Checking in EXCEL there does not appear to be a space and the format is
>>>> the same (e.g 'general').  I resolved the problem by copying over the
>>>> values for the two individuals where I identified  a problem.
>>>>
>>>> Given this problem was not detected by visual scanning I would
>>>> appreciate
>>>> advice on how this problem can be detected in future without my having
>>>> to
>>>> manually check raw data against R output.
>>>>
>>>> Any assistance is appreciated,
>>>>
>>>> Bob
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>> --
>> _________________________________________________________________
>> David Scott	Department of Statistics
>> 		The University of Auckland, PB 92019
>> 		Auckland 1142,    NEW ZEALAND
>> Phone: +64 9 923 5055, or +64 9 373 7599 ext 85055
>> Email:	d.scott at auckland.ac.nz,  Fax: +64 9 373 7018
>>
>> Director of Consulting, Department of Statistics
>>
>>
>


-- 
_________________________________________________________________
David Scott	Department of Statistics
		The University of Auckland, PB 92019
		Auckland 1142,    NEW ZEALAND
Phone: +64 9 923 5055, or +64 9 373 7599 ext 85055
Email:	d.scott at auckland.ac.nz,  Fax: +64 9 373 7018

Director of Consulting, Department of Statistics



More information about the R-help mailing list