[R] A "subscript out of bonds" and "write.table" problem on manipulating a large size dataset

Yong Wang wangyong1 at gmail.com
Mon May 21 22:28:41 CEST 2007

Dear all:

Described below is a large data set problem (data size > 2G after
unzipping, table delimited). I know R is not the
appropriate tool for such task, anyway
I did it on a server and get some straightforward problems.

1. The first is count.fields can count all the rows, however, when I
tried to remove rows beyond 3/5 of the data,R says
subscripts out of bounds, is there any option constraining the maximal
size for R to read in?

2. I rewrote the original data due to careless coding and find the
rewrote table delimited file does not match the
original file.
I experimented the code on a small dataset as attached at the end, no
problem at all for such small dataset.

I appreciate any tips and suggestions on how to remove the unwanted
rows in such a large dataset.

finally, thanks for all answering the tab delimited problem I rised yesterday.

### code as following ###

data.mm <- read.table(file,header=T,sep="\t",fill=T); 	#read in the large file
cf <- count.fields(file,sep="\t");		     #count fields	
n <- 23;				#the CORRECT fields for each row i.e., the number of variable name
del <- which(cf!=n);		# try to remove any row which has number of
fields not euqal to 23
del <- del-1;			# count cf contains the fields of header, -1 give the
row I want to remove

data.mm <- data.mm[-del,];	# try to remove the rows with fields number
unequal to 23
				### PROBLEM: R says "subscripts out of bonds"

	    quote=F,row.names=F); # since data.mm <- data.mm[-del,] aborted,
write the original data as mm_0206.txt
				  ### PROBLEM:then following code should have the same output

table(cf);			   # maximal fields number is 23
table( count.fields("mm_0206.txt",sep="\t")); # maximal fields number
larger than 23 and other things also unequle
					      # for example, original data has x rows with 10 fields, the wrote
					      # data has y row with 10 fields.
					      # if the original file is not correctly rewrote, probably
an equal length
					      # file will also not be wrote properly wrote, suppose
data.mm <- data.mm[-del,];
					      # get executed successfully.

####  experimental data set as following	###

V1	V2	V3	v4	v5	v6	v7	v8	v9
11	1	desc	A	1	34	1-Sep-00	1	first mid last
12	2	desc	B	6	56	2-Sep-00	1	First last
13	3	desc	A	7	32	3-Sep-00	1	last
14	4	desc	4-Sep-00	0	first mid last
15	5	desc	A	2	.	5-Sep-00	1	first mid last
16	6	desc	B	9	3	6-Sep-00	0	last
17	7		A	6	65	7-Sep-00	first last
18	8	desc	B	2	.	8-Sep-00	0	last
19	9	desc	A	8	56	9-Sep-00	1	first last
20	10	desc	B	5	89	10-Sep-00	0	first last

More information about the R-help mailing list