[R] Need help to read the data file like this

David Winsemius dwinsemius at comcast.net
Sun Jul 14 20:34:35 CEST 2013


On Jul 14, 2013, at 10:57 AM, David Winsemius wrote:

> 
> On Jul 14, 2013, at 9:48 AM, Houhou Li wrote:
> 
>> Hi,
>> 
>> I have several really big data files in csv format like this: the first line is the header, the second to fourth lines have info about the file and are the lines I need to skip (data in 2-4th lines are not correspoding to variable names in the hearder), from the fifth line, real data begins, but the last line is not a data line, it's the string "Done" instead of normal EOF character. All data is numeric. I tried to use read.table(), read.csv() with colClasses="numeric" and scan(), but couldn't make them work. Can anyone help me? How can I get rid of the last line "Done" automatically? I would like to use R script to do it automatically, not to do formatting in Excel then read back to R. Thank you very much, here is an example of the data:
> 
> Deleting the last line in Excel would not make sense unless this is already data in Excel. Better would be to sue a text editor. Less likely to corrupt the data.
> 
>> 
>> Tag,X,Y,BlobRegion,swaths,fr_int_20,fr_int_60,i60,RawTothgt,RawHtlc,RawRad20,RawRad40,RawRad60,RawRad80,CCV,BlobPerim,n_pts,n_pts_i255,vts,vts2,vtg,home,sum_ht,sum_ht_sq,dcch,dcch2,nb_ccv,n_nb,nb_sum_hts,nb_sum_hts2,z_tip_dist,nb_MassLen,n_f_rtns20,n_f_rtns60,max_fl_pt_count,loreyrawht,p00ile_cm,p25ile_cm,p50ile_cm,p75ile_cm,iq25,iq50,iq75,mean_intns
>> 01_24_2013.001,SF12
>>        5413
>>   509627.82,  4869704.98,   509999.83,  4869999.98
>> 123,509692.55,4869856.64,18,0,80.53,81.03,84,36.2100,17.1521,4.0359,4.0359,3.8881,2.9217,1737.13,31.42,210,210,0.828,0.955,0.281,28.50,5746.46,163727.12,0.764,1.000,1147.23,33,769.16,19024.42,0.01,0.09,174,163,174,34.90,140,2369,2849,3157,33,81,110,71.59
>> 159,509679.19,4869855.54,18,0,77.62,78.97,75,30.4000,11.2000,2.5319,2.5129,2.3365,1.8315,3248.82,21.42,90,90,0.877,0.936,0.589,22.91,2000.74,46861.45,0.691,0.999,1772.06,14,365.47,10233.32,0.04,0.68,81,66,81,33.29,905,1869,2272,2633,55,82,98,71.62
> 
> Read the first line with readLines using n=1 saving as 'colnams'
> Read the dat <- read.table( ...  with skip=4, sep=",", and fill = TRUE
> Delete last line holding "Done" and a large number of NA's
> names(dat) <- scan(text=colnams, what=character(0), sep="," )
> 
> (Tested. Expected results achieved.)

 Lines <- "Tag,X,Y,BlobRegion,swaths,fr_int_20,fr_int_60,i60,RawTothgt,RawHtlc,RawRad20, RawRad40,RawRad60,RawRad80,CCV,BlobPerim,n_pts,n_pts_i255,vts,vts2,vtg,home,sum_ht, sum_ht_sq,dcch,dcch2,nb_ccv,n_nb,nb_sum_hts,nb_sum_hts2,z_tip_dist,nb_MassLen, n_f_rtns20,n_f_rtns60,max_fl_pt_count,loreyrawht,p00ile_cm,p25ile_cm,p50ile_cm, p75ile_cm,iq25,iq50,iq75,mean_intns
 01_24_2013.001,SF12
         5413
    509627.82,  4869704.98,   509999.83,  4869999.98
 123,509692.55,4869856.64,18,0,80.53,81.03,84,36.2100,17.1521,4.0359,4.0359,3.8881, 2.9217,1737.13,31.42,210,210,0.828,0.955,0.281,28.50,5746.46,163727.12,0.764,1.000, 1147.23,33,769.16,19024.42,0.01,0.09,174,163,174,34.90,140,2369,2849,3157,33,81, 110,71.59
 159,509679.19,4869855.54,18,0,77.62,78.97,75,30.4000,11.2000,2.5319,2.5129,2.3365, 1.8315,3248.82,21.42,90,90,0.877,0.936,0.589,22.91,2000.74,46861.45,0.691,0.999, 1772.06,14,365.47,10233.32,0.04,0.68,81,66,81,33.29,905,1869,2272,2633,55,82,98,71.62
 Done"
 colnams <- readLines(textConnection(Lines), n=1)
 scan(text=colnams, what=character(0), sep="," ) # check scan code
# snipped
 dat <- read.table( text=Lines, skip=4, sep=",", fill = TRUE)
 dat <- dat[-NROW(dat), ]
 names(dat) <- scan(text=colnams, what=character(0), sep="," )
# Read 44 items
 dat

> -- 
> David
> 
> 
> David Winsemius
> Alameda, CA, USA
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA



More information about the R-help mailing list