[R] long to wide on larger data set

Matthew Dowle mdowle at mdowle.plus.com
Mon Jul 12 16:58:40 CEST 2010


Hi Juliet,

Thanks for the info.

It is very slow because of the == in  testData[testData$V2==one_ind,]

Why? Imagine someoone looks for 10 people in the phone directory. Would
they search the entire phone directory for the first person's phone number, 
starting
on page 1, looking at every single name, even continuing to the end of the 
book
after they had found them ?  Then would they start again from page 1 for the 
2nd
person, and then the 3rd, searching the entire phone directory from start to 
finish
for each and every person ?  That code using == does that.  Some of us call
that a 'vector scan' and is a common reason for R being percieved as slow.

To do that more efficiently try this :

testData = as.data.table(testData)
setkey(testData,V2)    # sorts data by V2
for (one_ind in mysamples) {
   one_sample <- testData[one_id,]
   reshape(one_sample)
}

or just this :

testData = as.data.table(testData)
setkey(testDate,V2)
testData[,reshape(.SD,...), by=V2]

That should solve the vector scanning problem, and get you on to the memory
problems which will need to be tackled. Since the 4 columns are character, 
then
the object size should be roughly :

    53860858 * 4 * 4 /1024^3 = 0.8GB

That is more promising to work with in 32bit so there is hope. [ That 0.8GB
ignores the (likely small) size of the unique strings in global string hash 
(depending
on your data). ]

Its likely that the as.data.table() fails with out of memory.  That is not 
data.table
but unique. There is a change in unique.c in R 2.12 which makes unique more
efficient and since factor calls unique, it may be necessary to use R 2.12.

If that still doesn't work, then there are several more tricks (and we will 
need
further information), and there may be some tweaks needed to that code as I
didn't test it,  but I think it should be possible in 32bit using R 2.12.

Is it an option to just keep it in long format and use a data.table ?

   testDate[, somecomplexrfunction(onecolumn, anothercolumn), by=list(V2) ]

Why you you need to reshape from long to wide ?

HTH,
Matthew



"Juliet Hannah" <juliet.hannah at gmail.com> wrote in message 
news:AANLkTinYvgMrVdP0SvC-fYlGOGn2RO0OMNuGQbXx_H2b at mail.gmail.com...
Hi Jim,

Thanks for responding. Here is the info I should have included before.
I should be able to access 4 GB.

> str(myData)
'data.frame':   53860857 obs. of  4 variables:
 $ V1: chr  "200003" "200006" "200047" "200050" ...
 $ V2: chr  "cv0001" "cv0001" "cv0001" "cv0001" ...
 $ V3: chr  "A" "A" "A" "B" ...
 $ V4: chr  "B" "B" "A" "B" ...
> sessionInfo()
R version 2.11.0 (2010-04-22)
x86_64-unknown-linux-gnu

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

On Mon, Jul 12, 2010 at 7:54 AM, jim holtman <jholtman at gmail.com> wrote:
> What is the configuration you are running on (OS, memory, etc.)? What
> does your object consist of? Is it numeric, factors, etc.? Provide a
> 'str' of it. If it is numeric, then the size of the object is
> probably about 1.8GB. Doing the long to wide you will probably need
> at least that much additional memory to hold the copy, if not more.
> This would be impossible on a 32-bit version of R.
>
> On Mon, Jul 12, 2010 at 1:25 AM, Juliet Hannah <juliet.hannah at gmail.com> 
> wrote:
>> I have a data set that has 4 columns and 53860858 rows. I was able to
>> read this into R with:
>>
>> cc <- rep("character",4)
>> myData <- 
>> read.table("myData.csv",header=FALSE,skip=1,colClasses=cc,nrow=53860858,sep=",")
>>
>>
>> I need to reshape this data from long to wide. On a small data set the
>> following lines work. But on the real data set, it didn't finish even
>> when I took a sample of two (rows in new data). I didn't receive an
>> error. I just stopped it because it was taking too long. Any
>> suggestions for improvements? Thanks.
>>
>> # start example
>> # i have commented out the write.table statement below
>>
>> testData <- read.table(textConnection("rs9999853,cv0084,A,A
>> rs999986,cv0084,C,B
>> rs9999883,cv0084,E,F
>> rs9999853,cv0085,G,H
>> rs999986,cv0085,I,J
>> rs9999883,cv0085,K,L"),header=FALSE,sep=",")
>> closeAllConnections()
>>
>> mysamples <- unique(testData$V2)
>>
>> for (one_ind in mysamples) {
>> one_sample <- testData[testData$V2==one_ind,]
>> mywide <- reshape(one_sample, timevar = "V1", idvar =
>> "V2",direction = "wide")
>> # write.table(mywide,file
>> ="newdata.txt",append=TRUE,row.names=FALSE,col.names=FALSE,quote=FALSE)
>> }
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem that you are trying to solve?
>



More information about the R-help mailing list