[R] How to pre-process fwf or csv files to remove unexpected characters in R?

Lucas Ferreira Mation lucasmation at gmail.com
Sun Nov 6 23:07:21 CET 2016


Thank you Bert, Jeff and  David for great answers.
Let me provide more context to clarify the question:

- I am running this on a large server (512GB), so the data still fits
into memory (and I also know how to process in chunks if necessary)
- I agree that DBMS and other software would me better suited, but our
team (part of gvt-body) is mostly comprised of statisticians and
economist who know R but not much of other languages or have admin
permissions to set up DBMS. It support is available but much more
scarce. So a pure R solution may be useful, and I want to understand
how much worst or more impractical it is. Performance is a secondary
concern.
- I haven't provided more concrete problems because I did not even
know how to find the lines that contain errors. With the suggestions
given, I think I will be able to do that


The mockup of the ideal function I would dream of having is something like:

clean_my_fwf <- function(infile, outfile, vector_of_error_row_numbers,
col_positions, col_types){
  - import_data from infile
  - in each row, check if the content of each caracter postion
correponds to the expected content of that variable given
col_positions and col_types. This must be true for all columns
  - write all lines that pass this test to outfile
  - return(vector_of_error_row_numbers) : containing the row number
(of the original dataset) of all rows that fail the test

}

where col_positions and col_types follow the syntax of readr::read_fwf
could be parallelized and C++ based






2016-11-06 14:16 GMT-02:00 David Winsemius <dwinsemius at comcast.net>:
>
>> On Nov 6, 2016, at 5:36 AM, Lucas Ferreira Mation <lucasmation at gmail.com> wrote:
>>
>> I have some large .txt files about ~100GB containing a dataset in fixed
>> width file. This contains some errors:
>> - character characters in column that are supposed to be numeric,
>> - invalid characters
>> - rows with too many characters, possibly due to invalid characters or some
>> missing end of line character (so two rows in the original data become one
>> row in the .txt file).
>>
>> The errors are not very frequent, but stop me from importing with readr
>> ::read_fwf()
>>
>>
>> Is there some package, or workflow, in R to pre-process the files,
>> separating the valid from the not-valid rows into different files? This can
>> be done by ETL point-click tools, such as Pentaho PDI. Is there some
>> equivalent code in R to do this?
>>
>> I googled it and could not find a solution. I also asked this in
>> StackOverflow and got no answer (here
>> <http://stackoverflow.com/questions/39414886/fix-errors-in-csv-and-fwf-files-corrupted-characters-when-importing-to-r>
>> ).
>
> Had I seen it there I would have voted to close (and just did) that SO question as too broad, although it is too vague because of lack of definition of "corrupted characters", and furthermore basically a request for a package recommendation (which is also off-topic on SO).
>
> For the csv part on a smaller file task (which you didn't repeat here) I would have pointed you to this answer:
>
> http://stackoverflow.com/questions/19082490/how-can-i-use-r-to-find-malformed-rows-and-fields-in-a-file-too-big-to-read-into/19083665#19083665
>
> For the fwf part (in a file that fits into RAM), I would have suggested wrapping table(nchar( . )) around readLines(file=filename). And then drilling down with which( nchar( . ) == <chosen_line_length> ) .
>
> I believe searching Rhelp will bring up examples of how to handle file input in chunks which should allow you to cobble together a strategy if you insist on using R ... the wrong tool. If you need to narrow your Rhelp archive search I suggest using the name "Jim Holtman" or "William Dunlap", or "Gabor Grothendieck" since they frequently have the most elegant strategies in my opinion.
>
> Here's search strategy implemented via MarkMail:
> http://markmail.org/search/?q=list%3Aorg.r-project.r-help+file+chunks+readlines
>
> But for files of the size you contemplate I would suggest using databases, awk or other editing software that is designed for streaming processing from disk. R is not so designed.
>
> --
> David.
>
>
>>
>> regards
>> Lucas Mation
>> IPEA - Brasil
>>
>>       [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius
> Alameda, CA, USA
>



More information about the R-help mailing list