[R] How to transpose it in a fast way?

David Winsemius dwinsemius at comcast.net
Fri Mar 8 20:11:11 CET 2013


On Mar 8, 2013, at 10:59 AM, David Winsemius wrote:

> 
> On Mar 8, 2013, at 9:31 AM, David Winsemius wrote:
> 
>> 
>> On Mar 8, 2013, at 6:01 AM, Jan van der Laan wrote:
>> 
>>> 
>>> You could use the fact that scan reads the data rowwise, and the fact that arrays are stored columnwise:
>>> 
>>> # generate a small example dataset
>>> exampl <- array(letters[1:25], dim=c(5,5))
>>> write.table(exampl, file="example.dat", row.names=FALSE. col.names=FALSE,
>>>  sep="\t", quote=FALSE)
>>> 
>> 
>> This might avoid creation of some of the intermediate copies:
>> 
>> MASS::write.matrix( matrix( scan("example.dat", what=character()), 5,5), file="fil.out")
>> 
>> I tested it up to a 5000 x 5000 file:
>> 
>>> exampl <- array(letters[1:25], dim=c(5000,5000))
>>> MASS::write.matrix( matrix( scan("example.dat", what=character()), 5000,5000), file="fil.out")
>> Read 25000000 items
>>> 
>> 
>> Not sure of the exact timing. Probably 5-10 minutes. The exampl-object takes 200,001,400 bytes. and did not noticeably stress my machine. Most of my RAM remains untouched. I'm going out on errands and will run timing on a 10K x 10K test case within a system.time() enclosure. Scan did report successfully reading 100000000 items fairly promptly.
>> 
> 
>> system.time( {MASS::write.matrix( matrix( scan("example.dat", what=character()), 10000,10000), file="fil.out") } )
> Read 100000000 items
>    user   system  elapsed 
> 487.100  912.613 1415.228 
> 
>> system.time( {MASS::write.matrix( matrix( scan("example.dat", what=character()), 500,500), file="fil.out") } )
> Read 250000 items
>   user  system elapsed 
>  1.184   2.507   3.834 
> 
> And so it seems to scale linearly:
> 
>> 3.834 * 100000000/250000
> [1] 1533.6

However, another posting today reminds us that this would best be attempted in a version of R that can handle matrices of that are larger than 2^15-1:

> 10000^2 <= 2^31-1
[1] TRUE
> 60000^2 <= 2^31-1
[1] FALSE

R 3.0 is scheduled for release soon and you can compile it from sources if your machine is properly equipped. It has larger integers, and I _think_ may support such larger matrices. 

-- 
David.
> 
>> -- 
>> David.
>> 
>>> # and read...
>>> d <- scan("example.dat", what=character())
>>> d <- array(d, dim=c(5,5))
>>> 
>>> t(exampl) == d
>>> 
>>> 
>>> Although this is probably faster, it doesn't help with the large size. You could used the n option of scan to read chunks/blocks and feed those to, for example, an ff array (which you ideally have preallocated).
>>> 
>>> HTH,
>>> 
>>> Jan
>>> 
>>> 
>>> 
>>> 
>>> peter dalgaard <pdalgd at gmail.com> schreef:
>>> 
>>>> On Mar 7, 2013, at 01:18 , Yao He wrote:
>>>> 
>>>>> Dear all:
>>>>> 
>>>>> I have a big data file of 60000 columns and 60000 rows like that:
>>>>> 
>>>>> AA AC AA AA .......AT
>>>>> CC CC CT CT.......TC
>>>>> ..........................
>>>>> .........................
>>>>> 
>>>>> I want to transpose it and the output is a new like that
>>>>> AA CC ............
>>>>> AC CC............
>>>>> AA CT.............
>>>>> AA CT.........
>>>>> ....................
>>>>> ....................
>>>>> AT TC.............
>>>>> 
>>>>> The keypoint is  I can't read it into R by read.table() because the
>>>>> data is too large,so I try that:
>>>>> c<-file("silygenotype.txt","r")
>>>>> geno_t<-list()
>>>>> repeat{
>>>>> line<-readLines(c,n=1)
>>>>> if (length(line)==0)break  #end of file
>>>>> line<-unlist(strsplit(line,"\t"))
>>>>> geno_t<-cbind(geno_t,line)
>>>>> }
>>>>> write.table(geno_t,"xxx.txt")
>>>>> 
>>>>> It works but it is too slow ,how to optimize it???
>>>> 
>>>> 
>>>> As others have pointed out, that's a lot of data!
>>>> 
>>>> You seem to have the right idea: If you read the columns line by line there is nothing to transpose. A couple of points, though:
>>>> 
>>>> - The cbind() is a potential performance hit since it copies the list every time around. geno_t <- vector("list", 60000) and then
>>>> geno_t[[i]] <- <etc>
>>>> 
>>>> - You might use scan() instead of readLines, strsplit
>>>> 
>>>> - Perhaps consider the data type as you seem to be reading strings with 16 possible values (I suspect that R already optimizes string storage to make this point moot, though.)
>>>> 
>>>> --
>>>> Peter Dalgaard, Professor
>>>> Center for Statistics, Copenhagen Business School
>>>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
>>>> Phone: (+45)38153501
>>>> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
>>>> 
>>>> ______________________________________________
> snipped

> David Winsemius
> Alameda, CA, USA
> 

David Winsemius
Alameda, CA, USA



More information about the R-help mailing list