[R] How to pre-process fwf or csv files to remove unexpected characters in R?

David Winsemius dwinsemius at comcast.net
Sun Nov 6 17:16:19 CET 2016


> On Nov 6, 2016, at 5:36 AM, Lucas Ferreira Mation <lucasmation at gmail.com> wrote:
> 
> I have some large .txt files about ~100GB containing a dataset in fixed
> width file. This contains some errors:
> - character characters in column that are supposed to be numeric,
> - invalid characters
> - rows with too many characters, possibly due to invalid characters or some
> missing end of line character (so two rows in the original data become one
> row in the .txt file).
> 
> The errors are not very frequent, but stop me from importing with readr
> ::read_fwf()
> 
> 
> Is there some package, or workflow, in R to pre-process the files,
> separating the valid from the not-valid rows into different files? This can
> be done by ETL point-click tools, such as Pentaho PDI. Is there some
> equivalent code in R to do this?
> 
> I googled it and could not find a solution. I also asked this in
> StackOverflow and got no answer (here
> <http://stackoverflow.com/questions/39414886/fix-errors-in-csv-and-fwf-files-corrupted-characters-when-importing-to-r>
> ).

Had I seen it there I would have voted to close (and just did) that SO question as too broad, although it is too vague because of lack of definition of "corrupted characters", and furthermore basically a request for a package recommendation (which is also off-topic on SO). 

For the csv part on a smaller file task (which you didn't repeat here) I would have pointed you to this answer:

http://stackoverflow.com/questions/19082490/how-can-i-use-r-to-find-malformed-rows-and-fields-in-a-file-too-big-to-read-into/19083665#19083665

For the fwf part (in a file that fits into RAM), I would have suggested wrapping table(nchar( . )) around readLines(file=filename). And then drilling down with which( nchar( . ) == <chosen_line_length> ) . 

I believe searching Rhelp will bring up examples of how to handle file input in chunks which should allow you to cobble together a strategy if you insist on using R ... the wrong tool. If you need to narrow your Rhelp archive search I suggest using the name "Jim Holtman" or "William Dunlap", or "Gabor Grothendieck" since they frequently have the most elegant strategies in my opinion.

Here's search strategy implemented via MarkMail:
http://markmail.org/search/?q=list%3Aorg.r-project.r-help+file+chunks+readlines

But for files of the size you contemplate I would suggest using databases, awk or other editing software that is designed for streaming processing from disk. R is not so designed.

-- 
David.


> 
> regards
> Lucas Mation
> IPEA - Brasil
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA



More information about the R-help mailing list