[R] Staging area for data before read into R

Ted Byers r.ted.byers at gmail.com
Tue Oct 21 21:18:26 CEST 2008


There are tradeoffs no matter what route you take.  

I worked on a project a few years ago, repairing an MS Access DB that had
been constructed, data entry forms and all, by one of the consulting
engineers.  They supported that development because they found that even
with all the power and utility of Excel, in support of data entry, errors
were still much too common, requiring significant time and therefore money
to address.  The more data you have, the more costly it is to find and
repair the data when something goes awry.  Their problem was that it was put
together in haste by a man who knew nothing about RDBMS.  He was learning as
he went.  While what he produced was adequate for a single site with no
turnover in staff, and would have been fine if it was intended for his own
use in his consulting practice, it would inevitably have broken the moment
the service was extended to more than one site or the moment there was any
turnover in staff at all.  The client he delivered it to was a large mining
company that wanted to deploy it on all their mines and ore processing
facilities.  Yes the error rate in data entry went way down, but the mistake
was in trying to deliver a software product to a client without the input of
an experienced software engineer.

You can do validation in Access as you can in Excel, but Excel is not
designed to manage data where Access is, and both are crippled by their
dependance on VB (a seriouusly broken language: fine for scripting MS
Office, but not what you want to develop a real application - not that the
OP wants that anyway).  I don't want to beat on Excel, as it is a useful
tool when used for what it is designed for; and others have pointed out some
hazards when using it.  Dr. Snow is right in recommending going the route of
using an RDBMS and in saying that it isn't that hard to get started.  I'd be
recommending PostgreSQL, though, since it is relatively easy to use, and it
has pl/r (which lets you run R code within stored procedures in the DB)
which carries obvious advantages.

The bottom line is that the best option depends on your objectives and what
you need to do.  You can use Excel quite effectively if you are careful and
know what you're doing.  If you are going to manage data that will require
significant effort to enter, you may want something a bit more robust and
better designed to manage your data.  If you are going to deliver services
based on your software, you need a software engineer to ensure it doesn't
break on your client (that could be quite costly).  Since the OP is
apparently using it for his own purposes, and unlikely to be selling the
data or services based on it, the services of an engineer aren't needed,
though they can be useful if there are concerns about administering the DB
so as to guarantee the security of the data.  I could tell you tales of how
data that cost millions of dollars to collect were almost lost because a
consultant was careless in this regard and made mistakes in handling the
data.  Fortunately, recovery was quick in these cases because my colleagues
were diligent in maintaining backups.  But you get the point.  

Murphy's law says that whatever can go wrong, will.

There are plenty of options, and the OP will need to do what he's most
confortable doing.

If I were in his place, I'd say my data is sacred, and can not be replaced
(just as you can't step into the same stream twice); and therefore I'd use a
RDBMS to manage it, and the very moment it is all entered, I'd make a backup
of both the data (e.g. in MySQL I'd use mysqldump) AND the software, and
copy both backups to two CDs or DVDs.  And, if the data were originally
recorded on paper, I'd be scanning the pages and copying those images onto a
couple CDs or DVDs also: with two copies on optical media, one copy can be
stored in a fireproof vault while the other is in the office ready to be
used should a HDD fail, or some other disaster interrupt my work.  OK, so
I'm paranoid about my data, but I'd rather go the extra mile than risk
losing it.

Cheers,

Ted


Gabor Grothendieck wrote:
> 
> Excel has a data validation facility and also has data input forms to
> facilitate data entry.
> 
> On Tue, Oct 21, 2008 at 1:45 PM, Greg Snow <Greg.Snow at imail.org> wrote:
>> Stephen,
>>
>> One of the big problems with spreadsheets (other than the column limit in
>> some) is that the standard entry mode allows too much flexibility which
>> does nothing to help you avoid data entry errors.  The Webpage:
>> http://www.burns-stat.com/pages/Tutor/spreadsheet_addiction.html has some
>> examples of this going wrong, including one that happened to my group
>> where the column for dates was not preformatted, the dates were entered
>> using European format, and Excel did 2 different wrong things with them
>> making it very difficult to do anything with the data without major extra
>> work.  If you are going to stick with a spreadsheet, then at a minimum
>> you should start by naming all your columns, then formatting each column
>> based on the type of data you expect to be entered there.
>>
>> Going the database route is not that much to learn to get started.  You
>> can use MSAccess, or the OpenOffice database, create a new table and
>> enter the names of each column along with the data type (this is a big
>> advantage in that it will not allow you to enter character data where
>> numbers are expected, forces dates to look like dates, etc.).  It is not
>> that much extra work to enter valid levels for what will become factors
>> (e.g. "Male" and "Female" for sex, so that those are the only values
>> allowed, my current record for datasets entered by others using
>> spreadsheets is 9 sexes).  Then you can pick up more as you go along, but
>> setting up the first database to enter data should only take you an hour
>> or so to learn the basics.
>>
>> Another option is to just use R, the following code gives one approach
>> that could get you started entering data:
>>
>> tmp <- rep( list(character(0), numeric(0)), c(2,5) )
>> names(tmp) <- c( 'ID','Sex', paste('Stream', 1:5, sep='') )
>> tmp <- as.data.frame(tmp)
>> levels(tmp$Sex) <- c("Female","Male")
>> tmp$ID <- as.character(tmp$ID)
>>
>> mydata <- edit(tmp)
>>
>> Hope this helps,
>>
>> --
>> Gregory (Greg) L. Snow Ph.D.
>> Statistical Data Center
>> Intermountain Healthcare
>> greg.snow at imail.org
>> 801.408.8111
>>
>>
>>> -----Original Message-----
>>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
>>> project.org] On Behalf Of stephen sefick
>>> Sent: Monday, October 20, 2008 7:02 PM
>>> To: R Help
>>> Subject: Re: [R] Staging area for data before read into R
>>>
>>> Well, I am going to type in ever value because the data sheets are of
>>> counts of insects that I identified, so I should be okay with
>>> accuracy...  I really just need something that allows for more than
>>> 256 columns as I have encounter over 256 species of insects in even
>>> small streams.  I think calc with it's 1000ish columns will do the
>>> trick... thanks everbody for your help.
>>>
>>> On Mon, Oct 20, 2008 at 8:25 PM, Gabor Grothendieck
>>> <ggrothendieck at gmail.com> wrote:
>>> > There is a list of free spreadsheets with their row and column limits
>>> > at this link:
>>> > http://en.wikipedia.org/wiki/OpenOffice.org_Calc
>>> >
>>> > On Mon, Oct 20, 2008 at 3:13 PM, stephen sefick <ssefick at gmail.com>
>>> wrote:
>>> >> sorry excel 2003 with no immediate update in the future.
>>> >>
>>> >> On Mon, Oct 20, 2008 at 3:12 PM, Gabor Grothendieck
>>> >> <ggrothendieck at gmail.com> wrote:
>>> >>> You didn't say which version of Excel you are using but Excel 2007
>>> >>> allows 16,384 columns.
>>> >>>
>>> >>> On Mon, Oct 20, 2008 at 2:27 PM, stephen sefick <ssefick at gmail.com>
>>> wrote:
>>> >>>> I am wondering if there is a better alternative than Excel for
>>> data
>>> >>>> storage that does not require database knowledge (I will
>>> eventually
>>> >>>> have to learn this, but it is not on my immediate todo list).  I
>>> need
>>> >>>> something that is not limited to 256 columns... I don't need any
>>> of
>>> >>>> the built in functions in excel just a spreadsheet like program
>>> with
>>> >>>> cells that hold data in a data.frame format for a staging area
>>> before
>>> >>>> I get it into R.  Any help would be greatly appreciated.  This is
>>> not
>>> >>>> a direct r question, but all of you folks have more experience
>>> than I
>>> >>>> do and I am having a time finding what I need with google.
>>> >>>> thanks in advance
>>> >>>>
>>> >>>> --
>>> >>>> Stephen Sefick
>>> >>>> Research Scientist
>>> >>>> Southeastern Natural Sciences Academy
>>> >>>>
>>> >>>> Let's not spend our time and resources thinking about things that
>>> are
>>> >>>> so little or so large that all they really do for us is puff us up
>>> and
>>> >>>> make us feel like gods.  We are mammals, and have not exhausted
>>> the
>>> >>>> annoying little problems of being mammals.
>>> >>>>
>>> >>>>                                                                -K.
>>> Mullis
>>> >>>>
>>> >>>> ______________________________________________
>>> >>>> R-help at r-project.org mailing list
>>> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> >>>> PLEASE do read the posting guide http://www.R-project.org/posting-
>>> guide.html
>>> >>>> and provide commented, minimal, self-contained, reproducible code.
>>> >>>>
>>> >>>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Stephen Sefick
>>> >> Research Scientist
>>> >> Southeastern Natural Sciences Academy
>>> >>
>>> >> Let's not spend our time and resources thinking about things that
>>> are
>>> >> so little or so large that all they really do for us is puff us up
>>> and
>>> >> make us feel like gods.  We are mammals, and have not exhausted the
>>> >> annoying little problems of being mammals.
>>> >>
>>> >>                                                                -K.
>>> Mullis
>>> >>
>>> >
>>>
>>>
>>>
>>> --
>>> Stephen Sefick
>>> Research Scientist
>>> Southeastern Natural Sciences Academy
>>>
>>> Let's not spend our time and resources thinking about things that are
>>> so little or so large that all they really do for us is puff us up and
>>> make us feel like gods.  We are mammals, and have not exhausted the
>>> annoying little problems of being mammals.
>>>
>>>                                                                 -K.
>>> Mullis
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-
>>> guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 

-- 
View this message in context: http://www.nabble.com/Staging-area-for-data-before-read-into-R-tp20075962p20097146.html
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list