[R] How to transpose it in a fast way?

Fri Mar 8 15:01:58 CET 2013

You could use the fact that scan reads the data rowwise, and the fact  
that arrays are stored columnwise:

# generate a small example dataset
exampl <- array(letters[1:25], dim=c(5,5))
write.table(exampl, file="example.dat", row.names=FALSE. col.names=FALSE,
     sep="\t", quote=FALSE)

# and read...
d <- scan("example.dat", what=character())
d <- array(d, dim=c(5,5))

t(exampl) == d

Although this is probably faster, it doesn't help with the large size.  
You could used the n option of scan to read chunks/blocks and feed  
those to, for example, an ff array (which you ideally have  
preallocated).

HTH,

Jan

peter dalgaard <pdalgd at gmail.com> schreef:

> On Mar 7, 2013, at 01:18 , Yao He wrote:
>
>> Dear all:
>>
>> I have a big data file of 60000 columns and 60000 rows like that:
>>
>> AA AC AA AA .......AT
>> CC CC CT CT.......TC
>> ..........................
>> .........................
>>
>> I want to transpose it and the output is a new like that
>> AA CC ............
>> AC CC............
>> AA CT.............
>> AA CT.........
>> ....................
>> ....................
>> AT TC.............
>>
>> The keypoint is  I can't read it into R by read.table() because the
>> data is too large,so I try that:
>> c<-file("silygenotype.txt","r")
>> geno_t<-list()
>> repeat{
>>  line<-readLines(c,n=1)
>>  if (length(line)==0)break  #end of file
>>  line<-unlist(strsplit(line,"\t"))
>> geno_t<-cbind(geno_t,line)
>> }
>> write.table(geno_t,"xxx.txt")
>>
>> It works but it is too slow ,how to optimize it???
>
>
> As others have pointed out, that's a lot of data!
>
> You seem to have the right idea: If you read the columns line by  
> line there is nothing to transpose. A couple of points, though:
>
> - The cbind() is a potential performance hit since it copies the  
> list every time around. geno_t <- vector("list", 60000) and then
> geno_t[[i]] <- <etc>
>
> - You might use scan() instead of readLines, strsplit
>
> - Perhaps consider the data type as you seem to be reading strings  
> with 16 possible values (I suspect that R already optimizes string  
> storage to make this point moot, though.)
>
> --
> Peter Dalgaard, Professor
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.