[R] Read text file subsetting rows

Zev Ross zev at zevross.com
Fri Apr 11 22:35:50 CEST 2008


Chuck,

Thanks so much, these both work like a charm. The first method, though, 
is very, very slow for a large dataset (<100,000) while the second is 
reasonable in terms of speed. If you or anyone have any ideas for 
speeding up the import send them my way otherwise the:

con2 <- pipe( 'grep "^RD" tmp.dat' )
dat2 <- read.csv( con2, sep='|', header=FALSE)

works well!

Thank you,

Zev

Charles C. Berry wrote:
> On Fri, 11 Apr 2008, Zev Ross wrote:
>
>> Hi All,
>>
>> Can anyone direct me to a read function in R that will allow me to only
>> read in rows of a text file that begin with a particular value such as
>> the data below. I would read the entire file in and then limit, but the
>> files were constructed such that the first two letters determine how
>> many variables are in the row (different letters mean different numbers
>> of columns and different column names/types).
>>
>> I can do this in SAS, but I'd prefer to use R. The approximate SAS code
>> is below with the key piece of code being "if rectype='RD'" then do.
>>
>> Thoughts?
>
> If your data are in 'tmp.dat':
>
>> txt <- readLines( "tmp.dat" ) con <- textConnection( grep( "^RD", 
>> txt, value=TRUE ) )
>> dat <- read.csv( con, sep='|', header=FALSE)
>> close(con)
>> summary( dat[ , 1:3 ] )
> V1 V2 V3
> RD:6 I:6 Min. :1
> 1st Qu.:1
> Median :1
> Mean :1
> 3rd Qu.:1
> Max. :1
>
> Alternatively, if you have 'grep' in your system and in the path:
>
>> con2 <- pipe( 'grep "^RD" tmp.dat' )
>> dat2 <- read.csv( con2, sep='|', header=FALSE)
>>
>
>
> See
> ?connection
> ?textConnection
> ?grep
>
> HTH,
>
> Chuck
>>
>> Zev
>>
>>
>> RD|I|01|073|0023|68103|5|7|017|810|20070103|00:00|0.6||3|||||||||||||
>> RD|I|01|073|0023|68103|5|7|017|810|20070106|00:00|9.5||3|||||||||||||
>> RD|I|01|073|0023|68103|5|7|017|810|20070109|00:00|2.5||3|||||||||||||
>> RD|I|01|073|0023|68103|5|7|017|810|20070112|00:00|13.7||3|||||||||||||
>> RD|I|01|073|0023|68103|5|7|017|810|20070115|00:00|7.3||3|||||||||||||
>> RA|I|01|073|0023|A334|5|7|017|810|20070118|00:00|3.7||3|||||||||||||
>> RD|I|01|073|0023|68103|5|7|017|810|20070121|00:00|6.9||3|||||||||||||
>> RC|I|01|073|0023|Quer|5|7|017|810|20070124|00:00|1.8||3|||||||||||||
>>
>>
>> infile 'C:\junk\RD_501_88101_2006-0.txt'
>> dlm='|' firstobs=3 missover;
>> rectype $2. @;
>> if rectype = 'RD' then do;
>>
>> -- 
>> Zev Ross
>> ZevRoss Spatial Analysis
>> 303 Fairmount Ave
>> Ithaca, NY 14850
>> 607-277-0004 (phone)
>> 866-877-3690 (fax, toll-free)
>> zev at zevross.com
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> Charles C. Berry (858) 534-2098
> Dept of Family/Preventive Medicine
> E mailto:cberry at tajo.ucsd.edu UC San Diego
> http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
>
>
>
>

-- 
Zev Ross
ZevRoss Spatial Analysis
303 Fairmount Ave
Ithaca, NY 14850
607-277-0004 (phone)
866-877-3690 (fax, toll-free)
zev at zevross.com



More information about the R-help mailing list