[R] How to transpose it in a fast way?

Thu Mar 7 01:56:25 CET 2013

On Wed, Mar 6, 2013 at 4:18 PM, Yao He <yao.h.1988 at gmail.com> wrote:
> Dear all:
>
> I have a big data file of 60000 columns and 60000 rows like that:
>
> AA AC AA AA .......AT
> CC CC CT CT.......TC
> ..........................
> .........................
>
> I want to transpose it and the output is a new like that
> AA CC ............
> AC CC............
> AA CT.............
> AA CT.........
> ....................
> ....................
> AT TC.............
>
> The keypoint is  I can't read it into R by read.table() because the
> data is too large,so I try that:
> c<-file("silygenotype.txt","r")
> geno_t<-list()
> repeat{
>   line<-readLines(c,n=1)
>   if (length(line)==0)break  #end of file
>   line<-unlist(strsplit(line,"\t"))
> geno_t<-cbind(geno_t,line)
> }
>  write.table(geno_t,"xxx.txt")
>
> It works but it is too slow ,how to optimize it???

I hate to be negative, but this will also not work on a 60000x 60000
matrix. At some point R will complain either about the lack of memory
or about you trying to allocate a vector that is too long.

I think your best bet is to look at file-backed data packages (for
example, package bigmemory). Look at this URL:
http://cran.r-project.org/web/views/HighPerformanceComputing.html and
scroll down to  Large memory and out-of-memory data. Some of the
packages may have the functionality you are looking for and may do it
faster than your code.

If this doesn't help, you _may_ be able to make your code work, albeit
slowly, if you replace the cbind() by data.frame. cbind() will in this
case produce a matrix, and matrices are limited to 2^31 elements,
which is less than 60000 times 60000. A data.frame is a special type
of list and so _may_ be able to handle that many elements, given
enough system RAM. There are experts on this list who will correct me
if I'm wrong.

If you are on a linux system, you can use split (type man split at the
shell prompt to see help) to split the file into smaller chunks of say
5000 lines or so. Process each file separately, write it into a
separate output file, then use the linux utility paste to "paste" the
files side-by-side into the final output.

Further, if you want to make it faster, do not grow geno_t by
cbind'ing a new column to it in each iteration. Pre-allocate a matrix
or data frame of an appropriate number of rows and columns and fill it
out as you go. But it will still be slow, which I think is due to the
inherent slowness of readLines and possibly strsplit.

HTH,

Peter