[R] Mean/SD of Each Position in Table

Dennis Murphy djmuser at gmail.com
Sun May 1 20:53:12 CEST 2011


Hi:

I would do something like the following:

(1) Create a vector of the file names.
(2) Use lapply() to read the files into a list.
(3) Use the reshape or reshape2 package to melt the individual files
into 'long' form.
(4) rbind together the resulting data frames.
(5) Use a summarization function to generate the means and standard deviations.

I created three data frames that have the structure you provided below
and wrote them out to csv files. The following code creates a vector
of file names, then uses lapply() to read the data files consecutively
and assign them to components of a list,. Next, I create a small
utility function that uses the reshape2 package to melt the data into
'long form'. The ldply function from package plyr is then called to
apply the function to each file and then to bind them all together
into a single data frame. Finally, the ddply() function in plyr is
used to get the mean and standard deviation for each time/substance
combination.

#### Code to create test files for the example
# File creation for test files:
ds_create <- function() {
   times <- paste('Time', 1:10, sep = '')
   cnames <- paste('Substance', 1:5, sep = '')
   m <- matrix(rpois(50, 7), nrow = 10)
   colnames(m) <- cnames
   m <- as.data.frame(m)
   m$Time <- times
   write.csv(m, file = paste(name, '.csv', sep = ''),
                  quote = FALSE, row.names = FALSE)
  }
nms <- paste('m', 1:3, sep = '')
sapply(nms, ds_create)
####

# Vector of file names
files <- paste('m', 1:3, '.csv', sep = '')
# Read the data frames into a list, where each data frame is a
separate component
filelst <- lapply(files, read.csv, header = TRUE)

library(plyr)
library(reshape2)
# Function to melt a generic data frame
f <- function(df) {
     melt.data.frame(df, id = 'Time', variable_name = 'Substance',
value_name = 'y')
   }
# Apply the function to each component of the list and rbind the
results together
bigdf <- ldply(filelst, f)
# Obtain the mean and sd for each Time/Substance combination
bigsumm <- ddply(bigdf, .(Time, Substance), summarise, mean = mean(y),
sd = sd(y))

# ----
Caveat: If you have the reshape package loaded, then at present the
value_name = assignment will not go through and the name of the last
variable will be 'value'. In that event, you can either rename 'value'
to 'y' with
names(bigdf)[3] <- 'y'
or change 'y' to 'value' before you invoke ddply() on bigdf(). Check
bigdf() with
head(bigdf)
to verify that the names expected are 'Time', 'Substance' and 'y'
before running the last command.
# ----
The result I get is
> dim(bigsumm)
[1] 50  4
> head(bigsumm)
    Time  Substance      mean       sd
1  Time1 Substance1 10.333333 2.516611
2  Time1 Substance2 10.666667 1.154701
3  Time1 Substance3  6.000000 2.645751
4  Time1 Substance4  6.333333 1.154701
5  Time1 Substance5  5.333333 1.527525
6 Time10 Substance1  4.666667 3.055050

The structure is what matters. You should be able to extend this
template to your 100 data frames.

HTH,
Dennis

On Sun, May 1, 2011 at 8:48 AM, Nemergut, Edward *HS
<EN3X at hscmail.mcc.virginia.edu> wrote:
> I have 100+ .csv files which have the basic format:
>
>> test
>        X Substance1 Substance2 Substance3 Substance4 Substance5
> 1   Time1         10          0          0          0          0
> 2   Time2          9          5          0          0          0
> 3   Time3          8         10          1          0          0
> 4   Time4          7         20          2          1          0
> 5   Time5          6         25          3          2          1
> 6   Time6          5         30          4          2          2
> 7   Time7          4         25          5          3          3
> 8   Time8          3         20          6          3          4
> 9   Time9          2         15          5          3          5
> 10 Time10          1         10          4          4          6
>
> Each table is of exactly the same dimensions.  After reading each of the
> 100+ .csv files into R, I want determine the mean and SD of each and every
> cell.  That is to ask, I to calculate the mean and SD for (Time1,Substance1)
> and every other cell from each of the 100+ .csv files.
>
> I imagine this is a fairly basic question, but my search has been
> unsuccessful.
>
> Thanks in advance,
> ECN
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list