[R] Adding column to dataframe

jim holtman jholtman at gmail.com
Thu Aug 19 12:32:47 CEST 2010


I think you are probably paging on your system.  Turn on your
performance metrics and look at it.  If the object you are processing
is all numeric, it would seem to require about 3.5GB of space (50% of
available memory).  Provide and 'str' and 'object.size' of the object
so that we can see what you are working with.  My rule of thumb is
that no single object should take more than 25-30% of memory since
copies may be made.  So the reasons things are taking 20 minutes is
you might be paging.  It is always good to break the problem into
pieces to see what is happening.  Read in only 25% of the data and
time it; then 50% and so on.  In any performance related problems you
need to determine where the "knee of the curve" it.  Never undertake
processing the large data file at once; start with some pieces and
work up so that you know what to expect.

On Wed, Aug 18, 2010 at 9:46 PM, Matt Cooper <mattcstats at gmail.com> wrote:
> Two questions:
> 1) Are there any good R guides/sites with information/techniques for dealing
> with large datasets in R? (Large being ~2 mil rows and ~200 columns)
>
> 2) My specific problem with this dataset.
>
> I am essentially trying to convert a date and add it to a data frame. I
> imagine any 'data manipulation on a column within dataframe into a new
> column' will present the same issue, be it as.Date or anything else.
>
> I have a dataset, size
>
>> dim(morbidity)
> [1] 1775683     264
>
> This was read in from a STATA .dta file. The dates have come in as the
> number of ms from 1960 so I have the following to convert these to usable
> dates.
>
> as.Date(morbidity$adm_date / (100*10*60*60*24), origin="1960-01-01")
>
> when I store this as a vector it is near instant, <5 seconds
> test <- as.Date(etc)
> when I place it over itself it takes ~20 minutes
> morbidity$adm_date <- as.Date(etc)
> when I place the vector over it (so no computation involved), or place it as
> a new column it still takes ~20 minutes
> morbidity$adm_date <- test
> morbidity$new_col <- test
> when I tried a cbind to add it that way it took >20 minutes
> new_morb <- cbind(morbidity,test)
>
> Has anyone done something similar or know of a different command that should
> work faster? I can't get my head around what R is doing, if it can create
> the vector instantly then the computation is quite simple, I don't
> understand why then adding it as a column to a dataframe can take that long.
>
> R64 bit on mac os x, 2.4 GHz dual core, 8gb ram so more than enough
> resources.
>
> Thanks
> Matt
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?



More information about the R-help mailing list