[R] Reading a specific column of a csv file in a loop

Tue Nov 8 11:47:26 CET 2011

2011/11/8 Sergio René Araujo Enciso <araujo.enciso at gmail.com>:
> Dear all:
>
> I have two larges files with 2000 columns. For each file I am
> performing a loop to extract the "i"th element of each file and create
> a data frame with both "i"th elements in order to perform further
> analysis. I am not extracting all the "i"th elements but only certain
> which I am indicating on a vector called "d".
>
> See  an example of my  code below
>
> ### generate an example for the CSV files, the original files contain
> more than 2000 columns, here for the sake of simplicity they have only
> 10 columns
> M1<-matrix(rnorm(1000), nrow=100, ncol=10,
> dimnames=list(seq(1:100),letters[1:10]))
> M2<-matrix(rnorm(1000), nrow=100, ncol=10,
> dimnames=list(seq(1:100),letters[1:10]))
> write.table(M1, file="M1.csv", sep=",")
> write.table(M2, file="M2.csv", sep=",")
>
> ### the vector containing the "i" elements to be read
> d<-c(1,4,7,8)
> P1<-read.table("M1.csv", header=TRUE)
> P2<-read.table("M1.csv", header=TRUE)
> for (i in d) {
> M<-data.frame(P1[i],P2[i])
> rm(list=setdiff(ls(),"d"))
> }
>
> As the files are quite large, I want to include "read.table" within
> the loop so as it only read the "i"th element. I know that there is
> the option "colClasses" for which I have to create a vector with zeros
> for all the columns I do not want to load. Nonetheless I have no idea
> how to make this vector to change in the loop, so as the only element
> with no zeros is the "i"th element following the vector "d". Any ideas
> how to do this? Or is there anz other approach to load only an
> specific element?
>

Its a bit messy if there are row names so lets generate M1.csv like this:

write.csv(M1, file = "M1.csv", row.names = FALSE)

Then we can do this:

nc <- ncol(read.csv("M1.csv", nrows = 1))
colClasses <- replace(rep("NULL", nc), d, NA)
M1.subset <- read.csv("M1.csv", colClasses = colClasses)

or using the same M1.csv that we just generated try this which uses
sqldf with the H2 backend:

library(sqldf)
library(RH2)

M0 <- read.csv("M1.csv", nrows = 1)[0L, ]
M1.subset.h2 <- sqldf(c("insert into M0 (select * from csvread('M1.csv'))",
        "select a, d, g, h from M0"))

This is referred to as Alternative 3 in FAQ#10 Example 6a on the sqldf
home page:
http://sqldf.googlecode.com
Alternative 1 and Alternative 2 listed there could also be tried.

(Note that although sqldf has a read.csv.sql command we did not use it
here since that command only works with the sqlite back end and the
RSQLite driver has a max of 999 columns.)

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com