[R] CSV value not being read as it appears

bgreen at dyson.brisnet.org.au bgreen at dyson.brisnet.org.au
Sat Jan 15 10:47:37 CET 2011


David,

Thanks. When I am back at work I will try to find out some specifics
regarding the original data base and how the reports are generated. The
differencs are not apparent via manual inspection.

I will look at the csv file in an editor as well and look into xlsReadWrite.

I agree EXCEL formatting of dates can be a problem.

regards

Bob

> I am a bit confused by this. You are doing a transfer from Excel (.xls
> or .xlsx) to .csv, then a subset in R and ending up with a couple of
> entries which are " Open" rather than "Open". So where are they coming
> from? You say they are not in the original Excel, so that suggests the
> transfer to .csv is the problem. I would be very surprised if the subset
> was a problem, but as others have indicated transfer to .csv can be
> downright ugly.
>
> You can check the .csv file by opening it in an editor (I use Emacs).
> Just go to the line and have a look if the extra space is there nestling
> between two commas.
>
> The other advice is, don't go through .csv. Go directly from Excel to R.
> My favourite tools are RODBC and xlsReadWrite for that step. Both work
> extremely well.
>
> As others have indicated, the big bugbear in the .csv route is dates, or
> what Excel decides are dates. My experience was the conversion of New
> Zealand health ID numbers to dates. They are three letters then 4
> digits, so AUG2699 became a date.
>
> David Scott
>
> On 14/01/2011 10:58 p.m., bgreen at dyson.brisnet.org.au wrote:
>> Hello David,
>>
>> Thanks for your e-mail. The data was a report derived from a statewide
>> database, saved in EXCEL format, so the usual issue of the vagaries of
>> human data entry variation wasn't the issue as the data was an automated
>> report, which is run every three months. I would not have even noticed
>> this problem if I hadn't been double checking the numbers of people by
>> district. Visual inspection didn't reveal this problem - no white space
>> was obvious and the spelling was identical. Tabulation via R wouldn't
>> have
>> detected this - I was obtaining the EXCEL totals via filter which I then
>> compared with R output. I'm hoping I can skip this step, in future, with
>> Jim's suggestion.
>>
>> regards
>>
>> Bob
>>
>>
>>> As a further note, this is a reminder that whenever you get data via a
>>> spreadsheet the first thing to do is examine it and clean up any
>>> problems. A basic requirement is to tabulate any categorical variable.
>>> Spreadsheets allow any sort of data to be entered, with no controls. My
>>> experience is that those who enter data into spreadsheets enter all
>>> sorts of variations of what a human would wish to treat as the same
>>> ("Open", "Open  ", "open", etc.), even when told not to.
>>>
>>> David Scott
>>>
>>> On 14/01/2011 4:03 p.m., Jim Holtman wrote:
>>>> try strip.white=TRUE to strip out white space
>>>>
>>>> Sent from my iPad
>>>>
>>>> On Jan 13, 2011, at 21:44, bgreen at dyson.brisnet.org.au wrote:
>>>>
>>>>>
>>>>> I have a frustrating issue which I am hoping someone may have a
>>>>> suggestion
>>>>> about.
>>>>>
>>>>> I am running XP and R 2.12.0 and saved an EXCEL file that I was sent
>>>>> as
>>>>> a
>>>>> csv file.
>>>>>
>>>>> The initial code I ran follows.
>>>>>
>>>>> dec<- read.csv("g://FMH/FO30122010.csv",header=T)
>>>>> dec.open<- subset (dec, Status == "Open")
>>>>> table(dec.open$AMHS)
>>>>>
>>>>> I was checking the output and noticed a difference between my manual
>>>>> count
>>>>> and R output. Two subject's rows were not being detected by the
>>>>> subset
>>>>> command:
>>>>>
>>>>> For the AMHS where there was a discrepancy I then ran:
>>>>> wm<- subset (dec, AMHS == "WM")
>>>>>
>>>>> The problem appears to be that there is a space before the 'Open"
>>>>> value
>>>>> for two indivduals, as per the example below.
>>>>>
>>>>> 10/02/2010  Open
>>>>> 22/08/2007   Open
>>>>>
>>>>> Checking in EXCEL there does not appear to be a space and the format
>>>>> is
>>>>> the same (e.g 'general').  I resolved the problem by copying over the
>>>>> values for the two individuals where I identified  a problem.
>>>>>
>>>>> Given this problem was not detected by visual scanning I would
>>>>> appreciate
>>>>> advice on how this problem can be detected in future without my
>>>>> having
>>>>> to
>>>>> manually check raw data against R output.
>>>>>
>>>>> Any assistance is appreciated,
>>>>>
>>>>> Bob
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>> --
>>> _________________________________________________________________
>>> David Scott	Department of Statistics
>>> 		The University of Auckland, PB 92019
>>> 		Auckland 1142,    NEW ZEALAND
>>> Phone: +64 9 923 5055, or +64 9 373 7599 ext 85055
>>> Email:	d.scott at auckland.ac.nz,  Fax: +64 9 373 7018
>>>
>>> Director of Consulting, Department of Statistics
>>>
>>>
>>
>
>
> --
> _________________________________________________________________
> David Scott	Department of Statistics
> 		The University of Auckland, PB 92019
> 		Auckland 1142,    NEW ZEALAND
> Phone: +64 9 923 5055, or +64 9 373 7599 ext 85055
> Email:	d.scott at auckland.ac.nz,  Fax: +64 9 373 7018
>
> Director of Consulting, Department of Statistics
>
>



More information about the R-help mailing list