[R] data input strategy - lots of csv files

Steve Miller steve.miller at jhu.edu
Thu May 11 12:59:46 CEST 2006


Why torture yourself and probably get bad performance in the process? You
should handle the data consolidation in python or ruby, which are much more
suited to this type of task, piping the results to R.

Steve Miller

-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Liaw, Andy
Sent: Thursday, May 11, 2006 5:50 AM
To: seanpor at acm.org; r-help
Subject: Re: [R] data input strategy - lots of csv files

This is what I would try:

csvlist <- list.files(pattern="csv$")
bigblob <- lapply(csvlist, read.csv, ...)
## Get all dates that appear in any one of them.
all.dates <- unique(unlist(lapply(bigblob, "[[", 1)))
bigdata <- matrix(NA, length(all.dates), length(bigblob))
dimnames(bigdata) <- list(all.dates, whatevercolnamesyouwant)
## loop through bigblob and populate corresponding columns
## of bigmatrix with the matching dates.
for (i in seq(along=bigblob)) {
    bigmatrix[as.character(bigblob[[i]][, 1]), i] <- 
        bigblob[[i]][, columnwithdata]
}

This is obviously untested, so hope it's of some help.

Andy

From: Sean O'Riordain
> 
> Good morning,
> I have currently 63 .csv files most of which have lines which 
> look like
>   01/06/05,23445
> Though some files have two numbers beside each date.  There 
> are missing values, and currently the longest file has 318 rows.
> 
> (merge() is losing the head and doing runaway memory 
> allocation - but thats another question - I'm still trying to 
> pin that issue down and make a small repeatable example)
> 
> Currently I'm reading in these files with lines like
>   a1 <- read.csv("daft_file_name_1.csv",header=F)
>   ...
>   a63 <- read.csv("another_silly_filename_63.csv",header=F)
> 
> and then i'm naming the columns in these like...
>   names(a1)[2] <- "silly column name"
>   ...
>   names(a63)[2] <- "daft column name"
> 
> then trying to merge()...
>   atot <- merge(a1, a2, all=T)
> and then using language manipulation to loop
>   atot <- merge(atot, a3, all=T)
>   ...
>   atot <- merge(atot, a63, all=T)
> etc...
> 
> followed by more language manipulation
> for() {
>   rm(a1)
> } etc...
> 
> i.e.
> for (i in 2:63) {
>     atot <- merge(atot, eval(parse(text=paste("a", i, 
> sep=""))), all=T)
>     #     eval(parse(text=paste("a",i,"[1] <- NULL",sep="")))
> 
>     cat("i is ", i, gc(), "\n")
> 
>     # now delete these 63 temporary objects...
>     # e.g. should look like rm(a33)
>     eval(parse(text=paste("rm(a",i,")", sep=""))) }
> 
> eventually getting a dataframe with the first column being 
> the date, and the subsequent 63 columns being the data... 
> with missing values coded as NA...
> 
> so my question is... is there a better strategy for reading 
> in lots of small files (only a few kbytes each) like that 
> which are timeseries with missing data... which doesn't go 
> through the above awkwardness (and language manipulation) but 
> still ends up with a nice data.frame with NA values correctly 
> coded etc.
> 
> Many thanks,
> Sean O'Riordain
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
>

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html




More information about the R-help mailing list