[Rd] Please make Pre-3.1 read.csv (type.convert) behavior available

Sat Apr 26 18:50:15 CEST 2014

On 26/04/2014, 12:28 PM, Tom Kraljevic wrote:
>
> Hi Duncan,
>
>
> Please allow me to add a bit more context, which I probably should have
> added to my original message.
>
> We actually did see this in an R 3.1 beta which was pulled by an apt-get
> and thought it had been released
> accidentally.  From my user perspective, the parsing of a string like
> “1.2345678901234567890” into a
> factor was so surprising, I actually assumed it was just a really bad
> bug that would be fixed before the
> “real" release.  I didn’t bother reporting it since I assumed beta users
> would be heavily impacted and
> there is no way it wouldn’t be fixed.  Apologies for that mistake on my
> part.

The beta stage is quite late.  There's a non-zero risk that a bug 
detected during the beta stage will make it through to release, 
especially if the report doesn't arrive until after we've switched to 
release candidates.

This change was made very early in the development cycle of 3.1.0, back 
in March 2013.  If you are making serious use of R, I'd really recommend 
that you try out some of the R-devel versions early, when design 
decisions are being made.  I suspect this feature would have been 
changed if we'd heard your complaints then.  It'll likely still be 
changed, but it is harder now, because some users already depend on the 
new behaviour.

>
> After discovering this new behavior really got released GA, I went
> searching to see what was going on.
> I found this bug, which states “If you wish to express your opinion
> about the new behavior, please do so
> on the R-devel mailing list."
>
> https://bugs.r-project.org/bugzilla/show_bug.cgi?id=15751

Actually it isn't the bug that said that, it was Simon :-).  if you look 
up some of his other posts on this topic here in the R-devel list, 
you'll see a couple of proposals for changes.

Duncan Murdoch

>
> So I’m sharing my opinion, as suggested.  Thanks to all for the time
> spent reading my opinion.
>
>
> Let me also say, we are huge fans of R; many of our customers use R, and
> we greatly appreciate the
> efforts of the R core team.  We are in the process of contributing an
> H2O package back to the R
> community and thanks to the CRAN moderators, as well, for their
> assistance in this process.
> CRAN is a fantastic resource.
>
>
> I would like to share a little more insight on how this behavior affects
> us, in particular.  These merits
> have probably already been debated, but let me state them here again to
> provide the appropriate
> context.
>
> 1.  When dealing with larger and larger data, things become cumbersome.
>   Your comment that
> specifying column types would work is true.  But when there are
> thousands+ of columns, specifying
> them one by one becomes more and more of a burden, and it becomes easier
> to make a mistake.
> And when you do make a mistake, you can imagine a tool writer choosing
> to just “do what it’s told”
> and swallowing the mistake.  (Trying not to be smarter than the user.)
>
> 2.  When working with datasets that have more and more rows, sometimes
> there is a bad row.
> Big data is messy.  Having one bad value in one bad row contaminate the
> entire dataset can be
> undesirable for some.  When you have millions of rows or more, each row
> becomes less precious.
> Many people would rather just ignore the effects of the bad row than try
> to fix it.  Especially in this
> case, when “bad” means a bit of extra precision that likely won’t have a
> negative impact on the result.
> (In our case, this extra precision was the output of Java’s
> Double.toString().)
>
> Our users want to use R as a driver language and a reference tool.
>   Being able to interchange
> data easily (even just snippets) between tools is very valuable.
>
>
> Thanks,
> Tom
>
>
> Below is an example of how you can create a million row dataset which
> works fine (parses as a
> numeric), but then adding just one bad row (which still *looks*
> numeric!) flips the entire column to
> a factor.  Finding that one row out of a million+ can be quite a challenge.
>
>
> # Script to generate dataset.
> $ cat genDataset.py
> #!/usr/bin/env python
>
> for x in range(0, 1000000):
>      print (str(x) + ".1")
>
> # Generate the dataset.
> $ ./genDataset.py > million.csv
>
> # R 3.1 thinks it’s a numeric.
> $ R
>  > df = read.csv("million.csv")
>  > str(df)
> 'data.frame':999999 obs. of  1 variable:
>   $ X0.1: num  1.1 2.1 3.1 4.1 5.1 6.1 7.1 8.1 9.1 10.1 ...
>
> # Add one more over-precision row.
> $ echo "1.2345678901234567890" >> million.csv
>
> # Now R 3.1 thinks it’s a factor.
> $ R
>  > df2 = read.csv("million.csv")
>  > str(df2)
> 'data.frame':1000000 obs. of  1 variable:
>   $ X0.1: Factor w/ 1000000 levels "1.1","1.2345678901234567890",..: 1
> 111113 222224 333335 444446 555557 666668 777779 888890 3 ...
>
>
>
>
>
> On Apr 26, 2014, at 4:28 AM, Duncan Murdoch <murdoch.duncan at gmail.com
> <mailto:murdoch.duncan at gmail.com>> wrote:
>
>> On 26/04/2014, 12:23 AM, Tom Kraljevic wrote:
>>>
>>> Hi,
>>>
>>> We at 0xdata use Java and R together, and the new behavior for
>>> read.csv has
>>> made R unable to read the output of Java’s Double.toString().
>>
>> It may be less convenient, but it's certainly not "unable".  Use
>> colClasses.
>>
>>
>>>
>>> This, needless to say, is disruptive for us.  (Actually, it was
>>> downright shocking.)
>>
>> It wouldn't have been a shock if you had tested pre-release versions.
>> Commercial users of R should be contributing to its development, and
>> that's a really easy way to do so.
>>
>> Duncan Murdoch
>>
>>>
>>> +1 for restoring old behavior.
>>
>>
>>
>