[R] Data cleaning & Data preparation, what do R users want?

Jim Lemon drjimlemon at gmail.com
Thu Nov 30 00:54:53 CET 2017


Hi Robert,
People want different levels of automation in the software they use.
What concerns many of us is the desire for the function
"figure-out-what-this-data-is-import-it-and-get-rid-of-bad-values".
Such users typically want something that justifies its use by being
written by someone who seems to know what they're doing and lots of
other people use it. One advantage of many R functions is their
modular construction. This encourages users to at least consider the
steps that are taken rather than just accept what comes out of that
long tube.

Take the contentious problem of outlier identification. If I just let
the black box peel off some values, I don't know what I have lost. On
the other hand, if I import data and examine it with a summary
function, I may find that one woman has a height of 5.2 meters. I can
range check by looking up the Guinness Book of Records. It's an
outlier. I can estimate the probability of such a height.  Hmm, about
4 standard deviations above the mean. It's an outlier. I can attempt a
Sherlock Holmes. "Watson, I conclude that an imperial measure (5'2")
has been recorded as a metric value". It's not an outlier.

The more R gravitates toward "black box" functions, the more some
users are encouraged to let them do the work.You pays your money and
you takes your chances.

Jim


On Thu, Nov 30, 2017 at 3:37 AM, Robert Wilkins <iwritecode2 at gmail.com> wrote:
> R has a very wide audience, clinical research, astronomy, psychology, and
> so on and so on.
> I would consider data analysis work to be three stages: data preparation,
> statistical analysis, and producing the report.
> This regards the process of getting the data ready for analysis and
> reporting, sometimes called "data cleaning" or "data munging" or "data
> wrangling".
>
> So as regards tools for data preparation, speaking to the highly diverse
> audience mentioned, here is my question:
>
> What do you want?
> Or are you already quite happy with the range of tools that is currently
> before you?
>
> [BTW,  I posed the same question last week to the r-devel list, and was
> advised that r-help might be a more suitable audience by one of the
> moderators.]
>
> Robert Wilkins
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list