[R] Strange result when subsetting a data frame based on a character variable

Duncan Murdoch murdoch.duncan at gmail.com
Tue Nov 17 21:27:12 CET 2015


On 17/11/2015 2:25 PM, Duncan Murdoch wrote:
> On 17/11/2015 2:14 PM, Karl Schilling wrote:
> > Dear all,
> >
> > I have one observation that I do not quite understand. Maybe someone
> > can clarify this issue for me.
> >
> > I have a data frame which I want to subset based on a grouping variable,
> > say "group". Actually, "group" is a numeric value, but it is saved as a
> > character. I give some code to generate an exemplary data frame below.
> >
> > Now, if I use
> >
> > MySubset <- subset(Data, Data$group == "..")
> >
> > everything works fine, as expected. ".." stands here for the value of
> > group given as a character string.
> >
> > Surprisingly, I also get a correct subsetting if I simply give the plain
> > numeric value of group (like MySubset <- subset(Data, Data$group == ..),
> > AS LONG AS this numeric value is less then 100000.
> >
> > If the numeric value is 100000 or larger, I get an empty subset.
> >
> > OK, I know how to avoid this situation, but I wonder what the
> > explanation for this for me rather strange behavior might be.
> >
> > Thank you so much for your suggestions.
>
> If you are comparing a character value to a numeric value, the numeric
> value is converted to character using as.character() for the
> comparison.  as.character(100000) or a larger number is likely not
> "100000"; try it.  (With the options I have on my
> computer, I get "1e+05".)
>
> If you want a numeric comparison, be explicit:
>
> subset(Data, as.numeric(Data$group) == ..)

This might be bad advice.  If Data$group is a factor (as it tends to be 
when character data is put in a dataframe), this will use the underlying 
factor code, not the visible one.  You need to use

as.numeric(as.character(Data$group))

to do the conversion you probably want.

Duncan Murdoch
>
>
> Duncan Murdoch
>
> >
> >
> > Karl Schilling
> >
> >
> > #####
> > Exemplary code for reproducing the above described problem:
> >
> > options(stringsAsFactors = F)
> >
> > # set up some data frame
> > value <- c(1:6)
> > group <- rep(c("20000", "99999", "100000"), each = 2)
> > Data <- data.frame(value = value, group = group)
> > str(Data)
> >
> > # subset data frame based on the value of the variable "group",
> > # treating this value once as a character, and once as a number:
> >
> > Data20 <- subset(Data, Data$group =="20000")
> > str(Data20)
> > Data20N <- subset(Data, Data$group ==20000)
> > str(Data20N)
> >
> >
> > Data99 <- subset(Data, Data$group =="99999")
> > str(Data99)
> > Data99N <- subset(Data, Data$group ==99999)
> > str(Data99N)
> > Data100 <- subset(Data, Data$group =="100000")
> > str(Data100)
> > Data100N <- subset(Data, Data$group ==100000)
> > str(Data100N)
> >
>



More information about the R-help mailing list