[Rd] (HP-UX) scan: last line gets duplicated (PR#790)

Tue, 26 Dec 2000 00:24:58 +0100 (CET)

Prof Brian Ripley skriver...

> Could you send us your example file for this problem?  I think the
> assumption is that a field will never be empty apart from white space.
> in which case there is a simpler patch, as fields only of white space
> don't put anything in the buffer.

I've done some more tests. Duplication doesn't only happen at
the end of the file. For each non-empty line followed by an
empty line or followed by a line with only white space, a copy
of the last non-empty line gets inserted. And in addition, at the end
of the file the last non-empty line gets duplicated.

Any file will do, just a few lines; with or without empty 
lines; with or without lines that contain white space only; with
the last line with data, empty or with white space.

Here is an example data file, begin and end of line are indicated
with |

  ||
  | |
  |0|
  ||
  | 1 1 |
  | |
  |  2 2 2  |
  |  |
  |   3 3 3   |
  |   |
  |    4    |
  |    |
  |  |
  | |
  | |
  ||

All lines, including the last line ends with \n

Doing this:

  scan(file="file", what="", sep="\n", strip.white=c(TRUE))

gives:

  Read 15 items
   [1] "0"     "0"     "1 1"   "1 1"   "2 2 2" "2 2 2" "3 3 3" "3 3 3" "4"
  [10] "4"     "4"     "4"     "4"     "4"     "4"

Doing this:

  scan(file="tst1",  what="", sep="\n")

correctly gives:

  Read 13 items
   [1] " "           "0"           " 1 1 "       " "           "  2 2 2  "
   [6] "  "          "   3 3 3   " "   "         "    4    "   "    "
  [11] "  "          " "           " "

The patch changes only one line of code. I don't see how you can
get a simpler patch, except that the patch may hide some
underlying problem.

After applying my patch, the first example gives the correct result:

  Read 5 items
  [1] "0"     "1 1"   "2 2 2" "3 3 3" "4" 

I have done even more tests, using a more complex data file:

  ||
  | |
  |0|
  ||
  |, 1 1 |
  | |
  |  2 , 2,2  |
  |  |
  |   ,3, 3 3 ,  |
  |   |
  |    4    |
  |    |
  |  |
  | |
  | |
  ||

And this time, I define "," as separator:

  scan(file="tst2",  what="", sep=",", strip.white=c(TRUE))

The original version gives:

  Read 20 items
   [1] "0"   "0"   "0"   "1 1" "1 1" "2"   "2"   "2"   "2"   "2"   "3"   "3 3"
  [13] ""    "4"   "4"   "4"   "4"   "4"   "4"   "4"   

Again, similar copying of non-empty fields into empty fields.

The patched version gives:

  Read 11 items
   [1] "0"   ""    "1 1" "2"   "2"   "2"   ""    "3"   "3 3" ""    "4"

The last result is the same as with my non-patched version on Linux.

I guess this is what you would want: the comma's at the
beginning and at the end of the line seem to imply an empty
field. And that is what you get. I guess the same happens with
two consecutive separators on one line with only white space in
between. "strip.white" doesn't imply "skip.empty.fields".

As a side note: the current definition of scan sets strip.white
to FALSE by default, and blank.lines.skip to TRUE. However, if
sep is set to "\n", the logical thing to do would be to not skip
blank lines by default, but including them as empty fields.

-- 
Peter Kleiweg
http://www.let.rug.nl/~kleiweg/

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._