[R] help usin scan on large matrix (caveats to what has been discussed before)

Martin Tomko martin.tomko at geo.uzh.ch
Thu Aug 12 15:35:32 CEST 2010


Hi baptiste,
thanks a lot. Could you please comment on that code, I cannto figure out 
what it does. Appart from the file name, what parameters does it need? 
Seems to me like you need to know the size of the table a priori. Is 
that right? Do you have to set up the block size depending on that (so 
that you get full multiples of the block to form the resulting frame)?
Cheers
Martin

On 8/12/2010 2:45 PM, baptiste Auguié wrote:
> Hi,
>
> I don't know if this can be useful to you, but I recently wrote a small function to read a large datafile like yours in a number of steps, with the possibility to save each intermediate block as .Rdata. This is based on read.table --- not as efficient as lower-level scan() but it might be good enough,
>
> file<- 'test.txt'
> ## write.table(matrix(rnorm(1e6*14), ncol=14), file=file,row.names = F,
> ##             col.names = F )
>
> n<- as.numeric(gsub("[^0123456789]","", system(paste("wc -l ", file), int=TRUE)))
> n
>
> blocks<- function(n=18, size=5){
> res<- c(replicate(n%/%size, size))
> if(n%%size) res<- c(res, n%%size)
> if(!sum(res) == n) stop("ERROR!!!")
> res
> }
> ## blocks(1003, 500)
>
>
> readBlocks<- function(file, nbk=1e5, out="tmp", save.inter=TRUE,
>                         classes= c("numeric", "numeric", rep("NULL", 6),
>                           "numeric", "numeric", rep("NULL", 4))){
>
>    n<- as.numeric(gsub("[^0123456789]","", system(paste("wc -l ", file), int=TRUE)))
>
>    ncols<- length(grep("NULL", classes, invert=TRUE))
>    results<- matrix(0, nrow=n, ncol=ncols)
>    Nb<- blocks(n, nbk)
>    skip<- c(0, cumsum(Nb))
>    for(ii in seq_along(Nb)){
>      d<- read.table(file, colClasses = classes, nrows=Nb[ii], skip=skip[ii], comment.char = "")
>      if(save.inter){
>        save(d, file=paste(out, ".", ii, ".rda", sep=""))
>        }
>      print(ii)
>      results[seq(1+skip[ii], skip[ii]+Nb[ii]), ]<- as.matrix(d)
>      rm(d) ; gc()
>    }
>    save(results, file=paste(out, ".rda", sep=""))
>    invisible(results)
> }
>
> ## test<- readBlocks(file)
>
> HTH,
>
> baptiste
>
>
>
> On Aug 12, 2010, at 1:34 PM, Martin Tomko wrote:
>
>    
>> Hi Peter,
>> thank you for your reply. I still cannot get it to work.
>> I have modified your code as follows:
>> rows<-length(R)
>> cols<- max(unlist(lapply(R,function(x) length(unlist(gregexpr(" ",x,fixed=TRUE,useBytes=TRUE))))))
>> c<-scan(file=f,what=rep(c(list(NULL),rep(list(0L),cols-1),rows-1)), skip=1)
>> m<-matrix(c, nrow = rows-1, ncol=cols+1,byrow=TRUE);
>>
>> the list c seems ok, with all the values I would expect. Still, length(c) gives me a value = cols+1, which I find odd (I would expect =cols).
>> I thine repeated it rows-1 times (to account for the header row). The values seem ok.
>> Anyway, I tried to construct the matrix, but when I print it, the values are odd:
>>      
>>> m[1:10,1:10]
>>>        
>>       [,1] [,2]       [,3]       [,4]       [,5]       [,6]       [,7]
>> [1,] NULL Integer,15 Integer,15 Integer,15 Integer,15 Integer,15 Integer,15
>> [2,] NULL Integer,15 Integer,15 Integer,15 Integer,15 Integer,15 Integer,15
>> [3,] NULL Integer,15 Integer,15 Integer,15 Integer,15 Integer,15 Integer,15
>> [4,] NULL Integer,15 Integer,15 Integer,15 Integer,15 Integer,15 Integer,15
>> [5,] NULL Integer,15 Integer,15 Integer,15 Integer,15 Integer,15 Integer,15
>> [6,] NULL Integer,15 Integer,15 Integer,15 Integer,15 Integer,15 Integer,15
>> [7,] NULL Integer,15 Integer,15 Integer,15 Integer,15 Integer,15 Integer,15
>> [8,] NULL Integer,15 Integer,15 Integer,15 Integer,15 Integer,15 Integer,15
>> [9,] NULL Integer,15 Integer,15 Integer,15 Integer,15 Integer,15 Integer,15
>> [10,] NULL Integer,15 Integer,15 Integer,15 Integer,15 Integer,15 Integer,15
>> ....
>>
>> Any idea where the values are gone?
>> Thanks
>> Martin
>>
>> Hence, I filled it into the matrix of dimensions
>>
>> On 8/12/2010 12:24 PM, peter dalgaard wrote:
>>      
>>> On Aug 12, 2010, at 11:30 AM, Martin Tomko wrote:
>>>
>>>
>>>        
>>>> c<-scan(file=f,what=list(c("",(rep(integer(0),cols)))), skip=1)
>>>> m<-matrix(c, nrow = rows, ncol=cols,byrow=TRUE);
>>>>
>>>> for some reason I end up with a character matrix, which I don't want. Is this the proper way to skip the first column (this is not documented anywhere - how does one skip the first column in scan???). is my way of specifying "integer(0)" correct?
>>>>
>>>>          
>>> No. Well, integer(0) is just superfluous where 0L would do, since scan only looks at the types not the contents, but more importantly, what= wants a list of as many elements as there are columns and you gave it
>>>
>>>
>>>        
>>>> list(c("",(rep(integer(0),5))))
>>>>
>>>>          
>>> [[1]]
>>> [1] ""
>>>
>>> I think what you actually meant was
>>>
>>> c(list(NULL),rep(list(0L),5))
>>>
>>>
>>>
>>>
>>>        
>>>> And finally - would any sparse matrix package be more appropriate, and can I use a sparse matrix for the image() function producing typical heat,aps? I have seen that some sparse matrix packages produce different looking outputs, which would not be appropriate.
>>>>
>>>> Thanks
>>>> Martin
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>          
>>>
>>>        
>>
>> -- 
>> Martin Tomko
>> Postdoctoral Research Assistant
>>
>> Geographic Information Systems Division
>> Department of Geography
>> University of Zurich - Irchel
>> Winterthurerstr. 190
>> CH-8057 Zurich, Switzerland
>>
>> email: 	martin.tomko at geo.uzh.ch
>> site:	http://www.geo.uzh.ch/~mtomko
>> mob: 	+41-788 629 558
>> tel: 	+41-44-6355256
>> fax: 	+41-44-6356848
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>      
>
>
>    


-- 
Martin Tomko
Postdoctoral Research Assistant

Geographic Information Systems Division
Department of Geography
University of Zurich - Irchel
Winterthurerstr. 190
CH-8057 Zurich, Switzerland

email: 	martin.tomko at geo.uzh.ch
site:	http://www.geo.uzh.ch/~mtomko
mob: 	+41-788 629 558
tel: 	+41-44-6355256
fax: 	+41-44-6356848



More information about the R-help mailing list