[R] Working with data frames
wdunlap at tibco.com
Thu Dec 11 17:06:44 CET 2014
Here is a reproducible example
> d <- read.csv(text="Name,Age\nBob,2\nXavier,25\nAdam,1")
'data.frame': 3 obs. of 2 variables:
$ Name: Factor w/ 3 levels "Adam","Bob","Xavier": 2 3 1
$ Age : int 2 25 1
Do you get something similar? If not, show us what you have (you
could trim it down to a few columns).
Let's try some plots.
This shows a plot of d$Age (on y axis) vs "Index", where Index is
1:length(d$Age). The points are at (1,2), (2,25), and (3,1). You gave
plot() no information about what should be on the x axis so it gave
you the index numbers.
Now asking for d$Name on the x axis and d$Age on the y.
> plot(d$Name, d$Age)
This put the names, in alphabetical order on the x axis. The y axis
ranges from about 0 to 25 and neither axis is labelled. There are
thick horizontal line segments where you expect the the points to
be. These are degenerate boxplots - when you ask to plot a
'factor' variable on the x axis and numbers on the y you get such
Some folks suggested you avoid factors by adding stringsAsFactors=FALSE
(or as.is=TRUE) to your call to read.csv. Let's try that
> d2 <- read.csv(stringsAsFactors=FALSE,
> plot(d2$Name, d2$Age)
Error in plot.window(...) : need finite 'xlim' values
In addition: Warning messages:
1: In xy.coords(x, y, xlabel, ylabel, log) : NAs introduced by coercion
2: In min(x) : no non-missing arguments to min; returning Inf
3: In max(x) : no non-missing arguments to max; returning -Inf
You get no plot at all.
You can get closer to what I think you want with
plot(as.integer(Name), Age, axes=FALSE, xlab="Name")
axis(side=2) # draw the usual y axis
axis(side=1, at=seq_along(levels(Name)), lab=levels(Name))
If you want the names in a different order on the x axis, then reconstruct
the factor object d$Name with a different order of levels. E.g.,
d$Name <- factor(d$Name, levels=c("Xavier", "Bob", "Adam"))
There are various plotting packages, e.g., ggplot2, that can make this
sort of thing easier, but I think the recommendation not to use factors
is wrong. You do need to learn how to use them to your advantage.
On Thu, Dec 11, 2014 at 5:00 AM, Sun Shine <phaedrusv at gmail.com> wrote:
> I am struggling with data frames and would appreciate some help please.
> I have a data set of 13 observations and 80 variables. The first column is
> the names of different political area boundaries (e.g. MHad, LBNW, etc),
> the first row is a vector of variable names concerning various census data
> (e.g. age.T, hse.Unk, etc.). The first cell [1,1] is blank.
> I have loaded this via read.csv('path.to/data.set.csv'), and now want to
> run some analyses on this data frame. If I want to get a list of the names
> of the political areas (i.e. the first column), the result is a vector of
> numbers which appear to correlate with the factors, but I don't get the
> text names, just the corresponding number. So, if I want to plot something
> basic, like the area that uses the most gas for central heating, for
> > plot(data.set$ch.Gas)
> The result is the y-axis gives the gas usage for the areas, but the x-axis
> gives only the numbers of the areas, not the names of the areas (which is
> So, two questions:
> (1) have I set up my csv file correctly to be read as a data frame as the
> first row of all of the remaining columns with the values for that
> political area in the corresponding row in the column with the specific
> variable name? So far, looking through tutorials and books seems to suggest
> yes, but at this point I'm no longer sure.
> (2) How can I access the names of the political areas when plotting so
> that these are given on the x-axis instead of the numbers?
> Thanks for any help.
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> PLEASE do read the posting guide http://www.R-project.org/
> and provide commented, minimal, self-contained, reproducible code.
[[alternative HTML version deleted]]
More information about the R-help