[R] create new variable with ifelse? (reproducible example)

(Ted Harding) Ted.Harding at wlandres.net
Sun Sep 16 00:02:20 CEST 2012


[See at end]
On 15-Sep-2012 20:36:49 Niklas Fischer wrote:
> Dear R users,
> 
> I have a reproducible data and try to create new variable "clo" is 1  if
> know variable is equal to "very well" or "fairly well" and getalong is 4 or
> 5
> otherwise it is 0.

>[A]
rep_data<- read.table(header=TRUE, text="
           id1        id2        know getalong
   100000016_a1 100000016_a2   very well        4
   100000035_a1 100000035_a2 fairly well       NA
   100000036_a1 100000036_a2   very well        3
   100000039_a1 100000039_a2   very well        5
   100000067_a1 100000067_a2   very well        5
   100000076_a1 100000076_a2 fairly well        5
")
 
rep_data$clo<- ifelse((rep_data$know==c("fairly well","very well") &
rep_data$getalong==c(4,5)),1,0)

> For sure, something must be wrong, I couldn't find it out.

rep_data
                      id1    id2 know getalong clo
100000016_a1 100000016_a2   very well        4   0
100000035_a1 100000035_a2 fairly well       NA   0
100000036_a1 100000036_a2   very well        3   0
100000039_a1 100000039_a2   very well        5   0
100000067_a1 100000067_a2   very well        5   0
100000076_a1 100000076_a2 fairly well        5   0

> Any help is appreciated..
> Bests,
> Niklas

There are several things wrong with the way you are trying to do it,
and indeed it is a bit complicated!

First: if the above table (at >[A] above) is the format in which
you input the data, then you should either comma-separate your
data fields (and use sep="," in read.table(), or else just use
read.csv()), or else enclose the two-word fields within "...",
i.e. EITHER:
>[B]
           id1,       id2,       know,   getalong
   100000016_a1, 100000016_a2,   very well,        4
   100000035_a1, 100000035_a2, fairly well,       NA
   100000036_a1, 100000036_a2,   very well,        3
   100000039_a1, 100000039_a2,   very well,        5
   100000067_a1, 100000067_a2,   very well,        5
   100000076_a1, 100000076_a2, fairly well,        5

OR:
>[C]
           id1        id2        know getalong
   100000016_a1 100000016_a2   "very well"        4
   100000035_a1 100000035_a2 "fairly well"       NA
   100000036_a1 100000036_a2   "very well"        3
   100000039_a1 100000039_a2   "very well"        5
   100000067_a1 100000067_a2   "very well"        5
   100000076_a1 100000076_a2 "fairly well"        5

Otherwise, in your original format, read.table() will read in
FIVE fields, since it will treat "very" and "well" as separate,
and will treat "fairly" and "well" as separate. Furthermore,
it will match the header "getalong" with the 5th field (4,NA,etc),
the header "know" with the 4th field ("well","well",...,"well"),
header "id2" with the 3rd field ("very","fairly","very",...,"fairly"),
and header "id1" with the 2nd field ("100000016_a2").

And even further more, the first field will become the row-names
of the dataframe and will no longer be data!

Second: Use of "==" to compare $know with "very well" and
"fairly well" will not work as you expect. In your comparison

  rep_data$know==c("fairly well","very well")

you will get the result:

  # [1] FALSE FALSE FALSE  TRUE FALSE FALSE

rather then your expected

  # [1] TRUE TRUE TRUE TRUE TRUE TRUE.

This is because "==" will compare $know with ONE ELEMENT of
c("fairly well","very well"), and will recycle these elements,
so it will compare $know successively with

"fairly well","very well" "fairly well","very well" "fairly well","very well"

and since $know is

"very well","fairly well","very well","very well","very well","fairly well"

the only match is in the 4th instance, which is why you get

  # [1] FALSE FALSE FALSE  TRUE FALSE FALSE

A better comparison is to use the "%in" operator, as in:

  rep_data$know %in% c("fairly well","very well")
  # [1] TRUE TRUE TRUE TRUE TRUE TRUE

so you can in the end do:

  rep_data$clo<-
    ifelse((rep_data$know %in% c("fairly well","very well")) &
           (rep_data$getalong %in% c(4,5)),1,0)

which results in:

  rep_data
  #            id1          id2        know getalong clo
  # 1 100000016_a1 100000016_a2   very well        4   1
  # 2 100000035_a1 100000035_a2 fairly well       NA   0
  # 3 100000036_a1 100000036_a2   very well        3   0
  # 4 100000039_a1 100000039_a2   very well        5   1
  # 5 100000067_a1 100000067_a2   very well        5   1
  # 6 100000076_a1 100000076_a2 fairly well        5   1

Finally, I suppose it is a happy coincidence that

  NA %in% c(4,5)

yields FALSE rather than what R might have been written to yield,
i.e. NA -- since NA is basically a synonym for "something that we
do not know the value of", strictly speaking we do not know the
value of NA %in% c(4,5). It is possible that the "something that
we do not know the value of" could be either 4 or 5, in which case
NA %in% c(4,5) would be TRUE; but it is also possible that the
"something that we do not know the value of" could be neither
4 nor 5, in which case NA %in% c(4,5) would be FALSE; but since
we do not know which of these possibilities is the case, we do
not know whether it should be TRUE or FALSE, so one can argue
that the result should itself be NA. But, as it happens,

  3 %in% c(4,5)
  # [1] FALSE
  4 %in% c(4,5)
  # [1] TRUE
  5 %in% c(4,5)
  # [1] TRUE
  NA %in% c(3,4)
  # [1] FALSE

so all is well!

Hoping this helps,
Ted.

-------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at wlandres.net>
Date: 15-Sep-2012  Time: 23:02:14
This message was sent by XFMail




More information about the R-help mailing list