[R] more on paste and bug

Wed Oct 10 21:14:33 CEST 2001

>>>>> "Peter" == Peter Dalgaard BSA <p.dalgaard at biostat.ku.dk> writes:

  Peter> Ott Toomet <siim at obs.ee> writes:
  >> Hi,
  >> 
  >> dput( ce0) gives a correct answer: > dput( ce0) c("1985", "9",
  >> "2", "2", "1", "A", "1", "", "NA", "5", "1999" )
  >> 
  >> The same does just print( ce0): > print( ce0) [1] "1985" "9" "2"
  >> "2" "1" "A" "1" "" "NA" "5" [11] "1999"
  >> 
  >> However, if I make a new similar vector ce0a: > ce0a <- c(
  >> 1985,9,2,2,1,"A",1,"",NA,5,1999)
  >> 
  >> Then the paste works correctly: > paste( ce0a, m, sep="",
  >> collapse="") [1]
  >> "1985<1>9<2>2<3>2<4>1<5>A<6>1<7><8>NA<9>5<0>1999END"
  >> 
  >> I had M as > m [1] "<1>" "<2>" "<3>" "<4>" "<5>" "<6>" "<7>"
  >> "<8>" "<9>" "<0>" "END"
  >> 
  >> So I have two apparently similar vectors which behave differently
  >> with paste: > paste( ce0a, m, sep="", collapse="") [1]
  >> "1985<1>9<2>2<3>2<4>1<5>A<6>1<7><8>NA<9>5<0>1999END" > paste(
  >> ce0, m, sep="", collapse="") [1]
  >> "1985<1>9<2>2<3>2<4>1<5>A1<7>NA<9>5<0>1999END" > ce0a [1] "1985"
  >> "9" "2" "2" "1" "A" "1" "" "NA" "5" [11] "1999" > ce0 [1] "1985"
  >> "9" "2" "2" "1" "A" "1" "" "NA" "5" [11] "1999"
  >> 
  >> I suggest there can be some hidden attributes somewhere in ce0
  >> which I have not noticed (there seem not to be factors), the
  >> problem seems to arise with the non-numerical columns (ce0 is
  >> just part of one row of the big dataframe).  Is it possible to
  >> figure it out, and possible change?  At least attributes() do
  >> show nothing: > attributes(ce0) NULL > attributes(ce0a) NULL

  Peter> Hmmm. The plot would seem to thicken around the entries in
  Peter> ce0 corresponding to <6> and <8>. If these accidentally
  Peter> contain \0 characters, much would be explained. Maybe also
  Peter> other weird characters.

As it happens, I think the problem is in the read.dta code. The relevant
piece of code is in foreign/src/stataread.c (lines 317-324):

	    default:
	        charlen=INTEGER(types)[j]-STATA_STRINGOFFSET;
	        PROTECT(tmp=allocString(charlen+1));
		InStringBinary(fp,charlen,CHAR(tmp));
		CHAR(tmp)[charlen]=0;
		SET_STRING_ELT(VECTOR_ELT(df,j),i,tmp);
		UNPROTECT(1);
	      break;

As it happens, in this case the string "A" is written in the file
as two bytes (I do not not know why) with the second byte being '\0'.
So the above code creates a CHARSXP of length 3 with last two bytes
being '\0'.

  Peter> What happens if you do nchar(ce0) ?  What if you omit the
  Peter> collapse= argument?

nchar uses strlen - so it would return the length as 1.

By the way, by looking at the code for mkChar and paste, it seems that
R is _not_ storing null terminated strings - mkChar only allocates
storage for strlen(name) and not strlen(name)+1 and paste uses LENGTH
to get the string length. At the same time strlen is used in
do_nchar. Could there be a potential problem here? Maybe you should
use strnlen in do_nchar?

Saikat
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._