[R] Preprocessing troublesome files in R - looking for some perl like functionality

Duncan Murdoch murdoch at stats.uwo.ca
Thu Jun 2 16:28:14 CEST 2005


Andy Bunn wrote:
> Hi all:
> 
> I have acquired a 100s of data files that I need to preprocess to get them
> usable in R. The files are fixed width (to a point) and contain 1 to 3 lines
> of header, followed by a variable number of fixed width data lines (that I
> can read with read.fwf). I want to read through the files and remove every
> _line_ where characters column 83-86 do not equal "STD". If I can do that
> and store it in a text file, then I can get the data I need using read.fwf.
> I can't figure out how to do this because of the irregularity of the header
> info buried in the file. It seems like the kind of thing perl or emacs would
> be good at but I'd like to do it all in R if possible. Any pointers
> appreciated.

Seems to me a couple of passes through read.fwf might work.  On the 
first pass, define one column running from columns 1 to 82, another from 
83 to 86, another from 87 to the longest possible line width.  All 
columns to be class "character".  Read using this format, select based 
on the 2nd column, and write out the selected lines -- or use the result 
as input to a textConnection.

Duncan Murdoch
> 
> -Andy
> 
> R > version
>          _
> platform i386-pc-mingw32
> arch     i386
> os       mingw32
> system   i386, mingw32
> status
> major    2
> minor    1.0
> year     2005
> month    04
> day      18
> language R
> 
> 
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> This is a snippet of one of the data files:
> 
> 929    2 Russia   Dahurian larch 150  6946-11249 1830 1990 -
> RAW
> RUSS061830 568  11122  1 806  1 843  2 862  3 902  31244  3 986  31210
> 31074  3  RAW
> RUSS0618401369  4 937  41154  4 869  4 702  4 716  4 972  4 682  5 878  5
> 582  5  RAW
> 929    2 Russia   Dahurian larch 150  6946-11249 1830 1990 -
> STD
> RUSS061830 568  11122  1 806  1 843  2 862  3 902  31244  3 986  31210
> 31074  3  STD
> RUSS0618401369  4 937  41154  4 869  4 702  4 716  4 972  4 682  5 878  5
> 582  5  STD
> RUSS0619701158 26 906 26 954 26 746 26 629 26 858 261268 261345 261102
> 261298 26  STD
> RUSS061980 483 26 780 26 995 261273 261391 26 996 261621 26 878 261418 26
> 514 26  STD
> RUSS0619901071 269990  09990  09990  09990  09990  09990  09990  09990
> 09990  0  STD
> 929    2 Russia   Dahurian larch 150  6946-11249 1830 1990 -
> RES
> RUSS061830 604  11215  1 889  1 828  2 909  3 982  31294  3 947  31091
> 31030  3  RES
> RUSS0618401290  4 858  41057  4 917  4 712  4 824  41077  4 709  5 911  5
> 747  5  RES
> RUSS061850 873  5 994  51179  71040  71028  7 923  71120  7 846 101146 11
> 854 13  RES
> RUSS0618601609 141209 16 780 16 758 171238 171191 17 858 17 903 17 930 18
> 334 18  RES
> 929    2 Russia   Dahurian larch 150  6946-11249 1830 1990 -
> ARS
> RUSS061850 873  5 994  51179  71040  71028  7 923  71120  7 846 101146 11
> 854 13  ARS
> RUSS0618601609 141209 16 780 16 758 171238 171191 17 858 17 903 17 930 18
> 334 18  ARS
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html




More information about the R-help mailing list