[R] How to transpose it in a fast way?

Thu Mar 7 02:13:08 CET 2013

On Wed, Mar 6, 2013 at 7:56 PM, Peter Langfelder
<peter.langfelder at gmail.com> wrote:
> On Wed, Mar 6, 2013 at 4:18 PM, Yao He <yao.h.1988 at gmail.com> wrote:
>> Dear all:
>>
>> I have a big data file of 60000 columns and 60000 rows like that:
>>
>> AA AC AA AA .......AT
>> CC CC CT CT.......TC
>> ..........................
>> .........................
>>
>> I want to transpose it and the output is a new like that
>> AA CC ............
>> AC CC............
>> AA CT.............
>> AA CT.........
>> ....................
>> ....................
>> AT TC.............
>>
>> The keypoint is  I can't read it into R by read.table() because the
>> data is too large,so I try that:
>> c<-file("silygenotype.txt","r")
>> geno_t<-list()
>> repeat{
>>   line<-readLines(c,n=1)
>>   if (length(line)==0)break  #end of file
>>   line<-unlist(strsplit(line,"\t"))
>> geno_t<-cbind(geno_t,line)
>> }
>>  write.table(geno_t,"xxx.txt")
>>
>> It works but it is too slow ,how to optimize it???
>
> I hate to be negative, but this will also not work on a 60000x 60000
> matrix. At some point R will complain either about the lack of memory
> or about you trying to allocate a vector that is too long.

Maybe this depends on the R version. I have not tried it, but the dev
version of R can handle much larger vectors. See
http://stat.ethz.ch/R-manual/R-devel/library/base/html/LongVectors.html

Yau He, if you are feeling adventurous you could give the development
version of R a try.

Best,
Ista

>
> I think your best bet is to look at file-backed data packages (for
> example, package bigmemory). Look at this URL:
> http://cran.r-project.org/web/views/HighPerformanceComputing.html and
> scroll down to  Large memory and out-of-memory data. Some of the
> packages may have the functionality you are looking for and may do it
> faster than your code.
>
> If this doesn't help, you _may_ be able to make your code work, albeit
> slowly, if you replace the cbind() by data.frame. cbind() will in this
> case produce a matrix, and matrices are limited to 2^31 elements,
> which is less than 60000 times 60000. A data.frame is a special type
> of list and so _may_ be able to handle that many elements, given
> enough system RAM. There are experts on this list who will correct me
> if I'm wrong.
>
> If you are on a linux system, you can use split (type man split at the
> shell prompt to see help) to split the file into smaller chunks of say
> 5000 lines or so. Process each file separately, write it into a
> separate output file, then use the linux utility paste to "paste" the
> files side-by-side into the final output.
>
> Further, if you want to make it faster, do not grow geno_t by
> cbind'ing a new column to it in each iteration. Pre-allocate a matrix
> or data frame of an appropriate number of rows and columns and fill it
> out as you go. But it will still be slow, which I think is due to the
> inherent slowness of readLines and possibly strsplit.
>
> HTH,
>
> Peter
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.