[R] reading a text file with a stray carriage return

Thu Mar 8 11:11:06 CET 2007

On 08-Mar-07 jim holtman wrote:
> How do you define a carriage return in the middle of a line
> if a carriage return is also used to delimit a line?  One of the
> things you can do is to use 'count.fields' to determine the
> number of fields in each line. For those lines that are not the
> right length, you could combine them together with a 'paste'
> command when you write them out.

For a moment I though, Walter, that you were simply suffering
from repeat carriage-returns (CR) and getting blank lines, but
then I saw that you refer to CRs "in the middle of input lines".
The "blank line" situation is quite easily handled by pe-processing
the file (before R sees it) if you have the tools available. For
eaxample. the file "test.txt"

  Line 1

  Line 2
  Line 3
  Line 4

  Line 5
  Line 6

(with 1 and then 2 extraneous blank lines), after being fed through

  awk '!/^$'

becomes

awk '!/^$/' < test.txt
  Line 1
  Line 2
  Line 3
  Line 4
  Line 5
  Line 6

since '!/^$/' cause awk to output those lines which do not ("!")
match ("/.../") the pattern "beginning of line" ("^") immediately
followed by "end of line" ("$") -- i.e. it filters out blank lines.

However, if your problem is extraneous CRs (or more generally End
of Line [EOL] which in DOS/Windows is CRLF [CR followed by Line Feed]
and in Unix/Linux is LF, or on a Mac is CR), then probably Jim
Holtman's suggestion, or something like it, is the approach to take.

But, as he implies, there may be more to your problem than meets
the eye. Clearly extraneous EOLs indicate that the path between
the original data and the preparation of the CSV file went wrong
somwhere, of which the extra EOLs are are symptom; but they may
not be the only symptom if you look carefully. Once people are
careless about preparing data (and it's very easy to be careless
when entring data into spreadsheets, for instance), then all sorts
of things can go wrong (and in the case of Excel, I've known Excel
iitself to invent content when the user accidentally strays outside
the boundaries of the spreadsheet). Hence Jim's suggestion of
counting fields is a useful start.

However, supposing that extraneous EOLs is your only problem,
the following strategy may solve it. Assume you have read in
the file, and can count the number of fields (NF) in each line
as read in. I'm assuming that (as described) you should have
3 fields per line. Then loop through the result, line by line:

1.  Read a line and count fields (NF)
2.  If NF=0, skip the line
3.  If NF>3, warning "Too many fields in Line XXX" and abort
4.  If NF=3, transfer the line to the "good" version
5.  If NF<3, make temporary copy "tcp" of line and then
5.1   Read next line and count fields (NF1)
5.2   If NF1=0 skip the line
5.3   If NF+NF1>3 then warning and abort as above
5.4   If NF+NF1=3 then paste line to "tcp" and transfer to "good"
5.5   If NF+NF1<3 paste to "tcp" and then
5.5.1    Read next line and count fields (NF2)
[and so on]

I've put "Warning and abort" cases in there because if "Too many
fields" occurs then you have indeed got more problems than simply
extraneous EOLs, and you have no option but to go into the file
by hand and see what is happening, and try to mend it.

Likewise, if you simply count fields for each line, you can see
how many instances there are of lines with too few fields. This
(if you are lucky) will indicate the magnitude of the task of
opening the file and editing out the errors by hand.

Again, this is the sort of thing with problem files which I have
usually approached using awk, since (in Linux/Usinx at least)
you can do

cat funnyfile.csv | awk 'BEGIN{FS=","}
  {nfields = NF; print nfields}' | sort -u

which (a) sets the Field Separator (FS) to "," (for CSV);
(b) counts the number of fields in a line (NF) and outputs it;
(c) pipes the output to "sort -u" which removes duplicates
and sorts the result. If the file is "clean" you will get only
one number in the result -- the number of fields that should be
in each line (in your case, 3). If you get more than one number,
then you can identify the lines with problems on the lines of

cat funnyfile.csv | awk 'BEGIN{FS=","}
  {nline = NR; nfields = NF}
  {if(nfields != 3){print nline " " nfields}'

which tracks the line count (Number of records [NR]), counts
the number of fields, checks whether this is wring and if so
outputs the number of the offending input line and the number
of fields it has.

One potential problem which would not be identified in this
way is the case where fields get omitted altogether and part
of the following record gets wrapped into the deficient
record, so that line breaks get out of step with the starts
of records. (I've known this happen too ... ) You can then
get the situation where records appear to have the right
numbers of fields, but they are the wrong fields.

I find I've been induced to go on at some length, because your
problem has reminded me of so many dreadfully mangled data files
I've met in the past, and have had to pick through in the kind
of way described above. You have my sympathy!

Good luck!
Ted.

> On 3/7/07, Walter R. Paczkowski <dataanalytics at earthlink.net> wrote:
>>
>>
>>   Hi,
>>   I'm  hoping someone has a suggestion for handling a simple problem. 
>>   A client  gave  me a comma separated value file (call it x.csv)
>>   that has an id and name and address for about 25,000 people
>>   (25,000 records).
>>   I used read.table to read it, but then discovered that there are
>>   stray carriage returns on several records. This plays havoc with
>>   read.table since it starts a new input line when it sees the
>>   carriage return. 
>>   In short, the read is all wrong.
>>   I thought I could write a simple function to parse a line and
>>   write it back  out,  character by character. If a carriage
>>   return is found, it  would  simply  be  ignored on the writing
>>   back out part. But how do I identify a carriage return? What is
>>   the code or symbol? Is there any easier way to rid the file
>>   of carriage returns in the middle of the input lines?
>>   Any help is appreciated.
>>   Walt Paczkowski

--------------------------------------------------------------------
E-Mail: (Ted Harding) <ted.harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 08-Mar-07                                       Time: 10:10:55
------------------------------ XFMail ------------------------------