[R] perform t.test by rows and columns in data frame

Brian Diggs diggsb at ohsu.edu
Fri Feb 24 01:18:18 CET 2012


See comments inline

On 2/23/2012 3:27 PM, Kara Przeczek wrote:
> Sorry. I forgot to note that I am using R version 2.8.0.

That's a rather old version; 2.8.0 came out in October 2008; maybe you 
don't have control over that, though.

> ________________________________________
> From: r-help-bounces at r-project.org [r-help-bounces at r-project.org] on behalf of Kara Przeczek [przeczek at unbc.ca]
> Sent: February 23, 2012 3:13 PM
> To: r-help at r-project.org
> Subject: [R] perform t.test by rows and columns in data frame
>
> Dear R Help,
>
> I have been struggling with this problem without making much headway.
> I am attempting to avoid using a loop, and would appreciate any
> suggestions you may have. I am not well versed in R and apologize in
> advance if I have missed something obvious.
>
> I have a data set with multiple sites along a river where metal
> concentrations were measured. Three sites are located upstream of a
> mine and three sites are located downstream of the mine. I would like
> to compare the upstream and downstream metal levels using a t-test.
>
>
> The data set looks something like this (but with more metals (25) and sites (6):
>
> TotalMetals    Mean    Site    Location
> Al    6000    1    us
> Sb    0.6    1    us
> Ba    150    1    us
> Al    6500    2    us
> Sb    0.7    2    us
> Ba    160    2    us
> Al    5600    3    ds
> Sb    0.8    3    ds
> Ba    180    3    ds
> Al    170    4    ds
> Sb    0.8    4    ds
> Ba    175    4    ds

This isn't a very useful way to send the data; it can miss some 
important aspects. Better to give a programmatic version which will 
insure everyone is looking at the same thing. The dput function can give 
you this.

mr2 <-
structure(list(TotalMetals = structure(c(1L, 3L, 2L, 1L, 3L,
2L, 1L, 3L, 2L, 1L, 3L, 2L), .Label = c("Al", "Ba", "Sb"), class = 
"factor"),
     Mean = c(6000, 0.6, 150, 6500, 0.7, 160, 5600, 0.8, 180,
     170, 0.8, 175), Site = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L,
     3L, 4L, 4L, 4L), Location = structure(c(2L, 2L, 2L, 2L, 2L,
     2L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("ds", "us"), class = 
"factor")), .Names = c("TotalMetals",
"Mean", "Site", "Location"), class = "data.frame", row.names = c(NA,
-12L))


> I have tried several variations of by() and aggregate() and tapply()
> without much luck. I thought I had finally got what I wanted with:
>
> by(mr2$Mean, mr2$TotalMetals, function (x) t.test(mr2$Mean[mr2$Location=="us"], mr2$Mean[mr2$Location=="ds"]))
>
> However, the output, although grouped by metal, had identical results
> for each metal with means for "x and y" equivalent to the mean of all
> metals within each site.

That is because the function you run on each group does not depend on x, 
the set of data for that subset.  You need to reference x, not mr2, 
inside that function.

by(mr2, mr2$TotalMetals, function(x) t.test(x$Mean[x$Location=="us"], 
x$Mean[x$Location=="ds"]))

Note also that the first argument is just mr2, not mr2$Mean as well.

You can also make use of the formula interface for t.test to simplify 
this even more.

by(mr2, mr2$TotalMetals, function(x) t.test(Mean~Location, data=x))

> mean(mr2$Mean[mr2$Location=="us"]) #gave the x mean from the output and,
>
> mean(mr2$Mean[mr2$Location=="ds"]) #gave the same y mean from the output
>
> I can get the answer I want by performing the t-test for each metal
> individually with:
>
> y=mr2[mr2$TotalMetals=="Al",]
>
> t.test(y$Mean[y$Location=="us"], y$Mean[y$Location=="ds"])

Note that here you are using y, not mr2, because you wanted to just use 
the subset. That was the mistake you made.

> But it would be painstaking to do this for each metal. In addition
> the data set will be getting larger in the future.
>
> It would also be nice to collect the output in a table or similar
> format for easy output, if possible.

The result of by is a list (length equal to the number of metals), each 
element of which is the result of a t.test (which is a list). To get a 
table, you need to specify what you want out of those results as columns 
of a table. Look at the lapply function for iterating over the list and 
subsetting to pull things out.

> I would greatly appreciate any help that you could provide!
> Thank you,
>
> Kara
> Natural Resources and Environmental Studies, MSc
> University of Northern B.C.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


-- 
Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health & Science University



More information about the R-help mailing list