[R] Staging area for data before read into R

Greg Snow Greg.Snow at imail.org
Tue Oct 21 19:45:45 CEST 2008


Stephen,

One of the big problems with spreadsheets (other than the column limit in some) is that the standard entry mode allows too much flexibility which does nothing to help you avoid data entry errors.  The Webpage: http://www.burns-stat.com/pages/Tutor/spreadsheet_addiction.html has some examples of this going wrong, including one that happened to my group where the column for dates was not preformatted, the dates were entered using European format, and Excel did 2 different wrong things with them making it very difficult to do anything with the data without major extra work.  If you are going to stick with a spreadsheet, then at a minimum you should start by naming all your columns, then formatting each column based on the type of data you expect to be entered there.

Going the database route is not that much to learn to get started.  You can use MSAccess, or the OpenOffice database, create a new table and enter the names of each column along with the data type (this is a big advantage in that it will not allow you to enter character data where numbers are expected, forces dates to look like dates, etc.).  It is not that much extra work to enter valid levels for what will become factors (e.g. "Male" and "Female" for sex, so that those are the only values allowed, my current record for datasets entered by others using spreadsheets is 9 sexes).  Then you can pick up more as you go along, but setting up the first database to enter data should only take you an hour or so to learn the basics.

Another option is to just use R, the following code gives one approach that could get you started entering data:

tmp <- rep( list(character(0), numeric(0)), c(2,5) )
names(tmp) <- c( 'ID','Sex', paste('Stream', 1:5, sep='') )
tmp <- as.data.frame(tmp)
levels(tmp$Sex) <- c("Female","Male")
tmp$ID <- as.character(tmp$ID)

mydata <- edit(tmp)

Hope this helps,

--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111


> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of stephen sefick
> Sent: Monday, October 20, 2008 7:02 PM
> To: R Help
> Subject: Re: [R] Staging area for data before read into R
>
> Well, I am going to type in ever value because the data sheets are of
> counts of insects that I identified, so I should be okay with
> accuracy...  I really just need something that allows for more than
> 256 columns as I have encounter over 256 species of insects in even
> small streams.  I think calc with it's 1000ish columns will do the
> trick... thanks everbody for your help.
>
> On Mon, Oct 20, 2008 at 8:25 PM, Gabor Grothendieck
> <ggrothendieck at gmail.com> wrote:
> > There is a list of free spreadsheets with their row and column limits
> > at this link:
> > http://en.wikipedia.org/wiki/OpenOffice.org_Calc
> >
> > On Mon, Oct 20, 2008 at 3:13 PM, stephen sefick <ssefick at gmail.com>
> wrote:
> >> sorry excel 2003 with no immediate update in the future.
> >>
> >> On Mon, Oct 20, 2008 at 3:12 PM, Gabor Grothendieck
> >> <ggrothendieck at gmail.com> wrote:
> >>> You didn't say which version of Excel you are using but Excel 2007
> >>> allows 16,384 columns.
> >>>
> >>> On Mon, Oct 20, 2008 at 2:27 PM, stephen sefick <ssefick at gmail.com>
> wrote:
> >>>> I am wondering if there is a better alternative than Excel for
> data
> >>>> storage that does not require database knowledge (I will
> eventually
> >>>> have to learn this, but it is not on my immediate todo list).  I
> need
> >>>> something that is not limited to 256 columns... I don't need any
> of
> >>>> the built in functions in excel just a spreadsheet like program
> with
> >>>> cells that hold data in a data.frame format for a staging area
> before
> >>>> I get it into R.  Any help would be greatly appreciated.  This is
> not
> >>>> a direct r question, but all of you folks have more experience
> than I
> >>>> do and I am having a time finding what I need with google.
> >>>> thanks in advance
> >>>>
> >>>> --
> >>>> Stephen Sefick
> >>>> Research Scientist
> >>>> Southeastern Natural Sciences Academy
> >>>>
> >>>> Let's not spend our time and resources thinking about things that
> are
> >>>> so little or so large that all they really do for us is puff us up
> and
> >>>> make us feel like gods.  We are mammals, and have not exhausted
> the
> >>>> annoying little problems of being mammals.
> >>>>
> >>>>                                                                -K.
> Mullis
> >>>>
> >>>> ______________________________________________
> >>>> R-help at r-project.org mailing list
> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> >>>> and provide commented, minimal, self-contained, reproducible code.
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> Stephen Sefick
> >> Research Scientist
> >> Southeastern Natural Sciences Academy
> >>
> >> Let's not spend our time and resources thinking about things that
> are
> >> so little or so large that all they really do for us is puff us up
> and
> >> make us feel like gods.  We are mammals, and have not exhausted the
> >> annoying little problems of being mammals.
> >>
> >>                                                                -K.
> Mullis
> >>
> >
>
>
>
> --
> Stephen Sefick
> Research Scientist
> Southeastern Natural Sciences Academy
>
> Let's not spend our time and resources thinking about things that are
> so little or so large that all they really do for us is puff us up and
> make us feel like gods.  We are mammals, and have not exhausted the
> annoying little problems of being mammals.
>
>                                                                 -K.
> Mullis
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list