[R] removed data is still there!

Greg Snow Greg.Snow at imail.org
Tue Sep 21 23:22:48 CEST 2010


This comes up every now and then.  The fact is that the behavior of R in not throwing away information unless explicitly told to, is a feature, and one that I don't want to see go away.

Yes in your example doing a table or plot based on iris1$Species gives meaningless results, but anything you do with that column in now meaningless, why do you care if there is extra information in a column that you should not be doing anything further with anyways?  Does it really make sense to use that column for anything now?  It is a bit like a teacher bemoaning the fact that half of his/her students scored below the class median.

Now some proposes that all factors should have levels dropped after subsetting, this is worse than useless, consider the following made up example:

tmp1 <- rep( c(1:5,1:5), c(10,20,30,20,0,0,10,20,30,20) )
result <- factor(tmp1, levels=1:5, labels=c('Strongly Disagree', 
	'Disagree', 'No Opinion', 'Agree', 'Strongly Agree') )

my.df <- data.frame( result=result, sex = rep( c('M','F'), each=80 ) )

df.m.2 <- df.m.1 <- my.df[ my.df$sex=='M', ]
df.f.2 <- df.f.1 <- my.df[ my.df$sex=='F', ]

df.m.1[] <- lapply( df.m.1, factor )
df.f.1[] <- lapply( df.f.1, factor )


dev.new()
par(mfrow=c(2,1))
barplot(table(df.m.1$result), main='Males')
barplot(table(df.f.1$result), main='Females')

dev.new()
par(mfrow=c(2,1))
barplot(table(df.m.2$result), main='Males')
barplot(table(df.f.2$result), main='Females')


Which pair of plots is more meaningful? Easier to read? Not misleading?



-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111


> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of Ivan Calandra
> Sent: Tuesday, September 21, 2010 7:23 AM
> To: r-help at r-project.org
> Subject: Re: [R] removed data is still there!
> 
>   Hi,
> 
> I knew about that way already, with factor(). Isn't there another
> possibility, directly at the subsetting step? That would be of great
> help
> iris1 <- iris[iris$Species == 'setosa',]  ## I mean here
> 
> Ivan
> 
> 
> 
> Le 9/21/2010 15:14, David Winsemius a écrit :
> >
> > On Sep 21, 2010, at 9:04 AM, David Winsemius wrote:
> >
> >>
> >> On Sep 21, 2010, at 8:39 AM, pdb wrote:
> >>
> >>>
> >>> Thanks, but that was what I just discovered myself the hard way.
> >>>
> >>> What I really wanted to know was how to solve this issue.
> >>
> >> Although that was _not_ what you requested in your first post.
> >>
> >> 2 options:
> >>
> >> ?table
> >>
> >> ?factor
> >>
> >> iris1$Species <-factor(iris$Species) # removes "extraneous" levels
> >
> > And that was not what I meant to type. Meant for factor to be applied
> > to second dataframe.:
> >
> > iris1$Species <-factor(iris1$Species) # removes "extraneous" levels
> >
> >
> >>
> >>> --
> >
> > David Winsemius, MD
> > West Hartford, CT
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> 
> --
> Ivan CALANDRA
> PhD Student
> University of Hamburg
> Biozentrum Grindel und Zoologisches Museum
> Abt. Säugetiere
> Martin-Luther-King-Platz 3
> D-20146 Hamburg, GERMANY
> +49(0)40 42838 6231
> ivan.calandra at uni-hamburg.de
> 
> **********
> http://www.for771.uni-bonn.de
> http://webapp5.rrz.uni-hamburg.de/mammals/eng/mitarbeiter.php
> 
> 
> 	[[alternative HTML version deleted]]



More information about the R-help mailing list