[R] Why Numeric Values Become Factors in Data Frame

Joshua Wiley jwiley.psych at gmail.com
Tue Nov 29 20:35:53 CET 2011


Hi Rich,

Try looking at:

levels(waterchem$SC)

There must be something in that column that is triggering R to read it
as character.  Potential examples include using "." to indicate
missing values or anything else that is not itself directly numeric.
You might also get some mileadge out of attempting to coerce the
factor labels to numeric and seeing what errors/warnings arise and if
any new values are missing.  For instance:

x <- factor(c("1", "2", "NA", "3e5", "."))

> levels(x)
[1] "."   "1"   "2"   "3e5" "NA"
> as.numeric(levels(x))
[1]    NA 1e+00 2e+00 3e+05    NA
Warning message:
NAs introduced by coercion

Nothing else comes to mind off the top of my head to try.  Once you
determine what is doing it, you can force the class in read.table
using the colClasses argument.

Cheers,

Josh

On Tue, Nov 29, 2011 at 11:18 AM, Rich Shepard <rshepard at appl-ecosys.com> wrote:
>  I have a data frame with 1 factor, one date, and 37 numeric values:
> str(waterchem)
> 'data.frame':   3525 obs. of  39 variables:
>  site      : Factor w/ 64 levels "D-1","D-2","D-3",..: 1 1 1 1 1 ...
>  $ sampdate  : Date, format: "2007-12-12" "2008-03-15" ...
>  $ CO3       : num  1 1 6.7 1 1 1 1 1 1 1 ...
>  $ HCO3      : num  231 228 118 246 157 208 338 285 260 240 ...
>  $ Ca        : num  100 88.4 63.4 123 78.2 103 265 213 178 166 ...
>  $ DO        : num  4.96 9.91 4.32 2.58 1.81 5.09 3.98 5.46 1.9 2.52 ...
>  ...
>  $ SC        : Factor w/ 841 levels "1.090","10.000",..: 635 638 363
>
>  All the numeric categories are read in as numbers except for some of those
> in column 'SC'. I have been looking in the source file for a couple of hours
> trying to learn why values such as 1.090 and 10.000 are seen as characters
> rather than numbers. I've not see the reason.
>
>  The source file is 860K and looks like this:
>
> site|sampdate|'Ag'|'Al'|'CO3'|'HCO3'|'Alk-Tot'|'As'|'Ba'|'Be'|'Bi'|'Ca'|'Cd'|'Cl'|'Co'|'Cr'|'Cu'|'DO'|'Fe'|'Hg'|'K'|'Mg'|'Mn'|'Mo'|'Na'|'NH4'|'NO3-NO2'|'Oil-grease'|'Pb'|'pH'|'Sb'|'SC'|'Se'|'SO4'|'Sr'|'TDS'|'Tl'|'V'|'Zn'
> 'D-1'|'2007-12-12'|0.000|0.106|1.000|231.000|231.000|0.011|0.000|0.002|0.000|100.000|0.000|1.430|0.000|0.006|0.024|4.960|4.110|NA|0.000|9.560|0.035|0.000|0.970|0.010|0.293|NA|0.025|7.800|0.001|630.000|0.001|65.800|0.000|320.000|0.001|0.000|11.400
> 'D-1'|'2008-03-15'|0.000|0.080|1.000|228.000|228.000|0.001|0.000|0.002|0.000|88.400|0.000|1.340|0.000|0.006|0.014|9.910|0.309|0.000|0.000|9.150|0.047|0.000|0.820|0.224|0.020|NA|0.025|7.940|0.001|633.000|0.001|75.400|0.000|300.000|0.001|0.000|12.400
>
>  The R command used to create the data frame is:
>        waterchem <- read.table('wqR.txt', header = TRUE, sep = '|')
>
>  Pointers on how to determine why this one variable has some values and
> characters rather than as numerics are needed.
>
> Rich
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, ATS Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/



More information about the R-help mailing list