[R] How to pre-process fwf or csv files to remove unexpected characters in R?

Jim Lemon drjimlemon at gmail.com
Sun Nov 6 22:56:40 CET 2016


Hi Lucas,
This is a rough outline of something I programmed years ago for data
cleaning (that was programmed in C). The basic idea is to read the
file line by line and check for a problem (in the initial application
this was a discrepancy between two lines that were supposed to be
identical). Here, if the line is the wrong length (optional) or
contains an unwanted character (this can be specified either as the
set of acceptable or unacceptable characters), the line is displayed
in an editor in which the user can manually fix it. The old file is
written line by line to a new file which replaces the old one. For
files in which bad lines are uncommon, this worked very well, as the
user only had to deal with errors. It is also only useful for files
containing only printable characters in the lines. Note that this is
only a sketch and I have not tested it.

cleanFile(filename,llength=NA,goodchars="[:print:]",badchars=NA) {
 infile<-file(filename,open="r")
 if(class(infile)=="connection") {
  done<-FALSE
  outfile<-file(paste("cF",filename,sep=""),"w")
  while(!done) {
   nextline<-readlines(infile,1)
   if(nchar(nextline) != llength && !is.na(llength)) nextline<-edit(nextline)
   if(!grepl(goodchars,nextline)) nextline<-edit(nextline)
   if(grep((badchars,nextline && !is.na(badchars)) nextline<-edit(nextline)
   writeLines(nextline,outfile)
   done<-nchar(nextline)<2
  }
  close(infile)
  close(outfile)
  file.remove(infile)
  file.rename(outfile,infile)
 } else {
  cat("Cannot open",file,"\n")
 }
}

Jim


On Mon, Nov 7, 2016 at 12:36 AM, Lucas Ferreira Mation
<lucasmation at gmail.com> wrote:
> I have some large .txt files about ~100GB containing a dataset in fixed
> width file. This contains some errors:
> - character characters in column that are supposed to be numeric,
> - invalid characters
> - rows with too many characters, possibly due to invalid characters or some
> missing end of line character (so two rows in the original data become one
> row in the .txt file).
>
> The errors are not very frequent, but stop me from importing with readr
> ::read_fwf()
>
>
> Is there some package, or workflow, in R to pre-process the files,
> separating the valid from the not-valid rows into different files? This can
> be done by ETL point-click tools, such as Pentaho PDI. Is there some
> equivalent code in R to do this?
>
> I googled it and could not find a solution. I also asked this in
> StackOverflow and got no answer (here
> <http://stackoverflow.com/questions/39414886/fix-errors-in-csv-and-fwf-files-corrupted-characters-when-importing-to-r>
> ).
>
> regards
> Lucas Mation
> IPEA - Brasil
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list