[R] How to transpose it in a fast way?

Yao He yao.h.1988 at gmail.com
Wed Mar 13 15:53:54 CET 2013


Thanks for everybody's help!

I learn a lot from this discuss!



2013/3/10 jim holtman <jholtman at gmail.com>:
> Did you check out the 'colbycol' package.
>
> On Fri, Mar 8, 2013 at 5:46 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>
>> On 03/08/2013 06:01 AM, Jan van der Laan wrote:
>>
>>>
>>> You could use the fact that scan reads the data rowwise, and the fact that
>>> arrays are stored columnwise:
>>>
>>> # generate a small example dataset
>>> exampl <- array(letters[1:25], dim=c(5,5))
>>> write.table(exampl, file="example.dat", row.names=FALSE. col.names=FALSE,
>>>      sep="\t", quote=FALSE)
>>>
>>> # and read...
>>> d <- scan("example.dat", what=character())
>>> d <- array(d, dim=c(5,5))
>>>
>>> t(exampl) == d
>>>
>>>
>>> Although this is probably faster, it doesn't help with the large size.
>>> You could
>>> used the n option of scan to read chunks/blocks and feed those to, for
>>> example,
>>> an ff array (which you ideally have preallocated).
>>>
>>
>> I think it's worth asking what the overall goal is; all we get from this
>> exercise is another large file that we can't easily manipulate in R!
>>
>> But nothing like a little challenge. The idea I think would be to
>> transpose in chunks of rows by scanning in some number of rows and writing
>> to a temporary file
>>
>>     tpose1 <- function(fin, nrowPerChunk, ncol) {
>>         v <- scan(fin, character(), nmax=ncol * nrowPerChunk)
>>         m <- matrix(v, ncol=ncol, byrow=TRUE)
>>         fout <- tempfile()
>>         write(m, fout, nrow(m), append=TRUE)
>>         fout
>>     }
>>
>> Apparently the data is 60k x 60k, so we could maybe easily read 60k x 10k
>> at a time from some file fl <- "big.txt"
>>
>>     ncol <- 60000L
>>     nrowPerChunk <- 10000L
>>     nChunks <- ncol / nrowPerChunk
>>
>>     fin <- file(fl); open(fin)
>>     fls <- replicate(nChunks, tpose1(fin, nrowPerChunk, ncol))
>>     close(fin)
>>
>> 'fls' is now a vector of file paths, each containing a transposed slice of
>> the matrix. The next task is to splice these together. We could do this by
>> taking a slice of rows from each file, cbind'ing them together, and writing
>> to an output
>>
>>     splice <- function(fout, cons, nrowPerChunk, ncol) {
>>         slices <- lapply(cons, function(con) {
>>             v <- scan(con, character(), nmax=nrowPerChunk * ncol)
>>             matrix(v, nrowPerChunk, byrow=TRUE)
>>         })
>>         m <- do.call(cbind, slices)
>>         write(t(m), fout, ncol(m), append=TRUE)
>>     }
>>
>> We'd need to use open connections as inputs and output
>>
>>     cons <- lapply(fls, file); for (con in cons) open(con)
>>     fout <- file("big_transposed.txt"); open(fout, "w")
>>     xx <- replicate(nChunks, splice(fout, cons, nrowPerChunk,
>> nrowPerChunk))
>>     for (con in cons) close(con)
>>     close(fout)
>>
>> As another approach, it looks like the data are from genotypes. If they
>> really only consist of pairs of A, C, G, T, then two pairs e.g., 'AA' 'CT'
>> could be encoded as a single byte
>>
>>     alf <- c("A", "C", "G", "T")
>>     nms <- outer(alf, alf, paste0)
>>     map <- outer(setNames(as.raw(0:15), nms),
>>                  setNames(as.raw(bitwShiftL(0:**15, 4)), nms),
>>                  "|")
>>
>> with e.g.,
>>
>> > map[matrix(c("AA", "CT"), ncol=2)]
>> [1] d0
>>
>> This translates the problem of representing the 60k x 60k array as a 3.6
>> billion element vector of 60k * 60k * 8 bytes (approx. 30 Gbytes) to one of
>> 60k x 30k = 1.8 billion elements (fits in R-2.15 vectors) of approx 1.8
>> Gbyte (probably usable in an 8 Gbyte laptop).
>>
>> Personally, I would probably put this data in a netcdf / rdf5 file.
>> Perhaps I'd use snpStats or GWAStools in Bioconductor
>> http://bioconductor.org.
>>
>> Martin
>>
>>
>>> HTH,
>>>
>>> Jan
>>>
>>>
>>>
>>>
>>> peter dalgaard <pdalgd at gmail.com> schreef:
>>>
>>>  On Mar 7, 2013, at 01:18 , Yao He wrote:
>>>>
>>>>  Dear all:
>>>>>
>>>>> I have a big data file of 60000 columns and 60000 rows like that:
>>>>>
>>>>> AA AC AA AA .......AT
>>>>> CC CC CT CT.......TC
>>>>> ..........................
>>>>> .........................
>>>>>
>>>>> I want to transpose it and the output is a new like that
>>>>> AA CC ............
>>>>> AC CC............
>>>>> AA CT.............
>>>>> AA CT.........
>>>>> ....................
>>>>> ....................
>>>>> AT TC.............
>>>>>
>>>>> The keypoint is  I can't read it into R by read.table() because the
>>>>> data is too large,so I try that:
>>>>> c<-file("silygenotype.txt","r"**)
>>>>> geno_t<-list()
>>>>> repeat{
>>>>>  line<-readLines(c,n=1)
>>>>>  if (length(line)==0)break  #end of file
>>>>>  line<-unlist(strsplit(line,"\**t"))
>>>>> geno_t<-cbind(geno_t,line)
>>>>> }
>>>>> write.table(geno_t,"xxx.txt")
>>>>>
>>>>> It works but it is too slow ,how to optimize it???
>>>>>
>>>>
>>>>
>>>> As others have pointed out, that's a lot of data!
>>>>
>>>> You seem to have the right idea: If you read the columns line by line
>>>> there is
>>>> nothing to transpose. A couple of points, though:
>>>>
>>>> - The cbind() is a potential performance hit since it copies the list
>>>> every
>>>> time around. geno_t <- vector("list", 60000) and then
>>>> geno_t[[i]] <- <etc>
>>>>
>>>> - You might use scan() instead of readLines, strsplit
>>>>
>>>> - Perhaps consider the data type as you seem to be reading strings with
>>>> 16
>>>> possible values (I suspect that R already optimizes string storage to
>>>> make
>>>> this point moot, though.)
>>>>
>>>> --
>>>> Peter Dalgaard, Professor
>>>> Center for Statistics, Copenhagen Business School
>>>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
>>>> Phone: (+45)38153501
>>>> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
>>>>
>>>> ______________________________**________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/**listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help>
>>>> PLEASE do read the posting guide http://www.R-project.org/**
>>>> posting-guide.html <http://www.R-project.org/posting-guide.html>
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>
>>> ______________________________**________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/**listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help>
>>> PLEASE do read the posting guide http://www.R-project.org/**
>>> posting-guide.html <http://www.R-project.org/posting-guide.html>
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>> --
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793
>>
>> ______________________________**________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/**listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help>
>> PLEASE do read the posting guide http://www.R-project.org/**
>> posting-guide.html <http://www.R-project.org/posting-guide.html>
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> --
> Jim Holtman
> Data Munger Guru
>
> What is the problem that you are trying to solve?
> Tell me what you want to do, not how you want to do it.
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
—————————————————————————
Master candidate in 2rd year
Department of Animal genetics & breeding
Room 436,College of Animial Science&Technology,
China Agriculture University,Beijing,100193
E-mail: yao.h.1988 at gmail.com
——————————————————————————



More information about the R-help mailing list