[R] Cleaning up messy Excel data

Greg Snow 538280 at gmail.com
Sat Mar 3 21:06:13 CET 2012


Sometimes we adapt to our environment, sometimes we adapt our
environment to us. I like fortune(108).

I actually was suggesting that you add a tool to your toolbox, not limit it.

In my experience (and I don't expect everyone else's to match) data
manipulation that seems easier in Excel than R is only easier until
the client comes back and wants me to redo the whole analysis with one
typo fixed.  Then rerunning the script in R (or Perl or other tool) is
a lot easier than trying to remember where all I clicked, dragged,
selected, etc.

I do use Excel for somethings (though I would be happy to find other
tools for that if it were possible to expunge Excel from the earth)
and Word (I actually like using R2wd to send tables and graphs to word
that I can then give to clients who just want to be able to copy and
paste them to something else), I just think that many of the tasks
that many people use excel for would be better served with a better
tool.

If someone reading this decides to put some more thought into a
project up front and actually design a database up front rather than
letting it evolve into some monstrosity in Excel, and that decision
saves them some later grief, then the world will be a little bit
better place.

On Fri, Mar 2, 2012 at 6:04 PM, jim holtman <jholtman at gmail.com> wrote:
> Unfortunately they only know how to use Excel and Word.  They are not
> folks who use a computer every day.  Many of them run factories or
> warehouses and asking them to use something like Access would not
> happen in my lifetime (I have retired twice already).
>
> I don't have any problems with them "messing" up the data that I send
> them; they are pretty good about making changes within the context of
> the spreadsheet.  The other issue is that I working with people in
> twenty different locations spread across the US, so I might be able to
> one of them to use Access (there is one I know that uses it), but that
> leaves 19 other people I would not be able to communicate with.
>
> The other thing is, is that I use Excel myself to slice/dice data
> since there are things that are easier in Excel than R (believe it or
> not).  There are a number of tools I keep in my toolkit, and R is
> probably the most important, but I have not thrown the rest of them
> away since they still serve a purpose.
>
> So if you can come up with a way to 20 diverse groups, who are not
> computer literate, to change over in a couple of days from Excel to
> Access let me know.  BTW, I tried to use Access once and gave it up
> because it was not as intuitive as some other tools and did not give
> me any more capability than the ones I was using.  So I know I would
> have a problem in convincing other to make the change just so they
> could communicate with me, while they still had to use Excel to most
> of their other interfaces.
>
> This is the real world where you have to learn how to adapt to your
> environment and make the best of it.  So you just have to learn that
> Excel can be your friend (or at least not your enemy) and can serve a
> very useful purpose in getting your ideas across to other people.
>
> On Fri, Mar 2, 2012 at 6:41 PM, Greg Snow <538280 at gmail.com> wrote:
>> Try sending your clients a data set (data frame, table, etc) as an MS
>> Access data table instead.  They can still view the data as a table,
>> but will have to go to much more effort to mess up the data, more
>> likely they will do proper edits without messing anything up (mixing
>> characters in with numbers, have more sexes than your biology teacher
>> told you about, add extra lines at top or bottom that makes reading
>> back into R more difficult, etc.)
>>
>> I have had a few clients that I talked into using MS Access from the
>> start to enter their data, there was often a bit of resistance at
>> first, but once they tried it and went through the process of
>> designing the database up front they ended up thanking me and believed
>> that the entire data entry process was easier and quicker than had the
>> used excel as they originally planned.
>>
>> Access is still part of MS office, so they don't need to learn R or in
>> any way break their chains from being prisoners of bill, but they will
>> be more productive in more ways than just interfacing with you.
>>
>> Access (databases in general) force you to plan things out and do the
>> correct thing from the start.  It is possible to do the right thing in
>> Excel, but Excel does not encourage (let alone force) you to do the
>> right thing, but makes it easy to do the wrong thing.
>>
>> On Thu, Mar 1, 2012 at 6:15 AM, jim holtman <jholtman at gmail.com> wrote:
>>> But there are some important reasons to use Excel.  In my work there
>>> are a lot of people that I have to send the equivalent of a data.frame
>>> to who want to look at the data and possibly slice/dice the data
>>> differently and then send back to me updates.  These folks do not know
>>> how to use R, but do have Microsoft Office installed on their
>>> computers and know how to use the different products.
>>>
>>> I have been very successful in conveying what I am doing for them by
>>> communicating via Excel spreadsheets.  It is also an important medium
>>> in dealing with some international companies who provide data via
>>> Excel and expect responses back via Excel.
>>>
>>> When dealing with data in a tabular form, Excel does provide a way for
>>> a majority of the people I work with to understand the data.  Yes,
>>> there are problems with some of the ways that people use Excel, and
>>> yes I have had to invest time in scrubbing some of the data that I get
>>> from them, but if I did not, then I would probably not have a job
>>> working for them.  I use R exclusively for the analysis that I do, but
>>> find it convenient to use Excel to provide a communication mechanism
>>> to the majority of the non-R users that I have to deal with.  It is a
>>> convenient "work-around" because I would never get them to invest the
>>> time to learn R.
>>>
>>> So in the real world these is a need to Excel and we are not going to
>>> cause it to go away; we have to learn how to live with it, and from my
>>> standpoint, it has definitely benefited me in being able to
>>> communicate with my users and continuing to provide them with results
>>> that they are happy with.  They refer to letting me work my "magic" on
>>> the data; all they know is they see the result via Excel and in the
>>> background R is doing the heavy lifting that they do not have to know
>>> about.
>>>
>>> On Wed, Feb 29, 2012 at 4:41 PM, Rolf Turner <rolf.turner at xtra.co.nz> wrote:
>>>> On 01/03/12 04:43, John Kane wrote:
>>>>>
>>>>> (mydata<- as.factor(c("1","2","3", ">2", "5", ">2")))
>>>>> str(mydata)
>>>>>
>>>>> newdata<- as.character(mydata)
>>>>>
>>>>> newdata[newdata==">2"]<- 0
>>>>> newdata<- as.numeric(newdata)
>>>>> str(newdata)
>>>>>
>>>>> We really need to keep Excel (and other spreadsheets) out of peoples
>>>>> hands.
>>>>
>>>>
>>>> Amen, bro'!!!
>>>>
>>>>    cheers,
>>>>
>>>>        Rolf Turner
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>>
>>> --
>>> Jim Holtman
>>> Data Munger Guru
>>>
>>> What is the problem that you are trying to solve?
>>> Tell me what you want to do, not how you want to do it.
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>>
>> --
>> Gregory (Greg) L. Snow Ph.D.
>> 538280 at gmail.com
>
>
>
> --
> Jim Holtman
> Data Munger Guru
>
> What is the problem that you are trying to solve?
> Tell me what you want to do, not how you want to do it.



--
Gregory (Greg) L. Snow Ph.D.
538280 at gmail.com



More information about the R-help mailing list