[R] long to wide on larger data set

jim holtman jholtman at gmail.com
Mon Jul 12 16:51:51 CEST 2010


You might want to do 'object.size' on myData to see how big it is and
then if you do try to run reshape again take a look and see if there
is any paging happening on your system which may be an indication that
you don't have enough memory.  Also with 53M observations, it may take
a lot of time to determine how to do the reshape.

You can also approach the problem in parts.  Take 10K observations and
see how long it takes and how much memory is used; then 100K, then 1M.
 This may give you an idea of the growth in both time and memory.

When you have something really big, it is a good idea to start with a
subset and see what resources are used.  This will give you an idea of
how much it will take for the complete set.

When you do the runs, report back on the memory and CPU time required.

On Mon, Jul 12, 2010 at 9:19 AM, Juliet Hannah <juliet.hannah at gmail.com> wrote:
> Hi Jim,
>
> Thanks for responding. Here is the info I should have included before.
> I should be able to access 4 GB.
>
>> str(myData)
> 'data.frame':   53860857 obs. of  4 variables:
>  $ V1: chr  "200003" "200006" "200047" "200050" ...
>  $ V2: chr  "cv0001" "cv0001" "cv0001" "cv0001" ...
>  $ V3: chr  "A" "A" "A" "B" ...
>  $ V4: chr  "B" "B" "A" "B" ...
>> sessionInfo()
> R version 2.11.0 (2010-04-22)
> x86_64-unknown-linux-gnu
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> On Mon, Jul 12, 2010 at 7:54 AM, jim holtman <jholtman at gmail.com> wrote:
>> What is the configuration you are running on (OS, memory, etc.)?  What
>> does your object consist of?  Is it numeric, factors, etc.?  Provide a
>> 'str' of it.  If it is numeric, then the size of the object is
>> probably about 1.8GB.  Doing the long to wide you will probably need
>> at least that much additional memory to hold the copy, if not more.
>> This would be impossible on a 32-bit version of R.
>>
>> On Mon, Jul 12, 2010 at 1:25 AM, Juliet Hannah <juliet.hannah at gmail.com> wrote:
>>> I have a data set that has 4 columns and 53860858 rows. I was able to
>>> read this into R with:
>>>
>>> cc <- rep("character",4)
>>> myData <- read.table("myData.csv",header=FALSE,skip=1,colClasses=cc,nrow=53860858,sep=",")
>>>
>>>
>>> I need to reshape this data from long to wide. On a small data set the
>>> following lines work. But on the real data set, it didn't finish even
>>> when I took a sample of two (rows in new data). I didn't receive an
>>> error. I just stopped it because it was taking too long. Any
>>> suggestions for improvements? Thanks.
>>>
>>> # start example
>>> # i have commented out the write.table statement below
>>>
>>> testData <- read.table(textConnection("rs9999853,cv0084,A,A
>>> rs999986,cv0084,C,B
>>>  rs9999883,cv0084,E,F
>>>  rs9999853,cv0085,G,H
>>>  rs999986,cv0085,I,J
>>>  rs9999883,cv0085,K,L"),header=FALSE,sep=",")
>>>  closeAllConnections()
>>>
>>> mysamples <- unique(testData$V2)
>>>
>>> for (one_ind in mysamples) {
>>>   one_sample <- testData[testData$V2==one_ind,]
>>>   mywide <- reshape(one_sample, timevar = "V1", idvar =
>>> "V2",direction = "wide")
>>> #   write.table(mywide,file
>>> ="newdata.txt",append=TRUE,row.names=FALSE,col.names=FALSE,quote=FALSE)
>>> }
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>>
>> --
>> Jim Holtman
>> Cincinnati, OH
>> +1 513 646 9390
>>
>> What is the problem that you are trying to solve?
>>
>



-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?



More information about the R-help mailing list