[R] Data cleaning & Data preparation, what do R users want?

Jim Lemon drjimlemon at gmail.com
Thu Nov 30 01:00:46 CET 2017


Hi again,
Typo in the last email. Should read "about 40 standard deviations".

Jim

On Thu, Nov 30, 2017 at 10:54 AM, Jim Lemon <drjimlemon at gmail.com> wrote:
> Hi Robert,
> People want different levels of automation in the software they use.
> What concerns many of us is the desire for the function
> "figure-out-what-this-data-is-import-it-and-get-rid-of-bad-values".
> Such users typically want something that justifies its use by being
> written by someone who seems to know what they're doing and lots of
> other people use it. One advantage of many R functions is their
> modular construction. This encourages users to at least consider the
> steps that are taken rather than just accept what comes out of that
> long tube.
>
> Take the contentious problem of outlier identification. If I just let
> the black box peel off some values, I don't know what I have lost. On
> the other hand, if I import data and examine it with a summary
> function, I may find that one woman has a height of 5.2 meters. I can
> range check by looking up the Guinness Book of Records. It's an
> outlier. I can estimate the probability of such a height.  Hmm, about
> 4 standard deviations above the mean. It's an outlier. I can attempt a
> Sherlock Holmes. "Watson, I conclude that an imperial measure (5'2")
> has been recorded as a metric value". It's not an outlier.
>
> The more R gravitates toward "black box" functions, the more some
> users are encouraged to let them do the work.You pays your money and
> you takes your chances.
>
> Jim
>
>
> On Thu, Nov 30, 2017 at 3:37 AM, Robert Wilkins <iwritecode2 at gmail.com> wrote:
>> R has a very wide audience, clinical research, astronomy, psychology, and
>> so on and so on.
>> I would consider data analysis work to be three stages: data preparation,
>> statistical analysis, and producing the report.
>> This regards the process of getting the data ready for analysis and
>> reporting, sometimes called "data cleaning" or "data munging" or "data
>> wrangling".
>>
>> So as regards tools for data preparation, speaking to the highly diverse
>> audience mentioned, here is my question:
>>
>> What do you want?
>> Or are you already quite happy with the range of tools that is currently
>> before you?
>>
>> [BTW,  I posed the same question last week to the r-devel list, and was
>> advised that r-help might be a more suitable audience by one of the
>> moderators.]
>>
>> Robert Wilkins
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list