[R] mean for subset

Jerry Floren jerry.floren at state.mn.us
Thu Jan 7 16:16:13 CET 2010


As a novice R user, I face a similar challenge. I am almost afraid to share
with this group how I solved it. About 65 labs in our proficiency program
submit data on individual Excel spreadsheets with triple replicates. There
always are a few labs that do not complete the full set of three replicates,
and I do not want their data included in my analysis.

First, I combine all the individual spreadsheets into one large Excel
spreadsheet. The replicates are in three columns: rep1, rep2, and rep3. I
sort on each individual rep column in Excel. Then I go to both the top and
the bottom of the list. 

For example, I sort on rep1 and go to the top of the list to delete any rows
where a value for rep1 was not recorded. Then I go to the bottom of the list
and delete any rows where rep1 is text instead of a number, for example,
<0.001. I should say that the labs are instructed that they must complete
all three replicates, and they must not enter results as text. Next I repeat
the process for rep2 and rep3. 

I'll do a little more work in Excel on the large, combined table with all
the lab data. I calculate in Excel the mean, standard deviation, and
coefficient of variation for each of the three reps. Finally, I filter all
the data and delete duplicate rows. This is necessary as I sometimes
accidentally copy the same spreadsheet two times from a lab into my large
table. Finally, I save the cleaned up table in *.csv format that is easily
read into R. 

I know that R can do all of these things, but if you are just learning how
to use R it might be easier to do some initial work in Excel, or a similar
spreadsheet, before running your data through R.

I also use MS-Word's mail merge feature to generate my code. I'll get three
or four pages of code doing what I want for a single analytical test, for
example, calcium. Then I'll use the mail merge feature to generate hundreds
of pages of code with the other analytical tests (nitrogen, phosphorus,
potassium, etc.). I just copy and paste the large, merged Word document into
R. R cranks away for 30 minutes and I end up with several large tables (and
these get additional editing in Ecel) and hundreds of beautiful graphs that
would take weeks to create in Excel.

I was amazed that Word would work. I expected all of Word's special print
control codes would mess things up. I just recently received a new laptop
computer, and now I have an occassional problem with Word's "pretty print
quotes," but if you know about that problem, it is easy to fix.

Jerry Floren
Minnesota Department of Agriculture 





Matthew Dowle-3 wrote:
> 
> 
> As can data.table (i.e. do 'having' in one statement) :
> 
>> DT = data.table(DF)
>> DT[,list(n=length(NAME),mean(SCORE)),by="NAME"][n==3]
>       NAME n       V2
> [1,] James 3 64.00000
> [2,]   Tom 3 78.66667
>>
> 
> but data.table isn't restricted to SQL functions (such as avg),  any R 
> functions can be used, sometimes for their side effects (such as plotting) 
> rather than just returning data.
> 
> Further data.table has a thing called 'join inherited scoping'.   Say we 
> knew the specific groups,  we can go directly to them (without even
> looking 
> at the rest of the data in the table) in very short and convenient syntax, 
> which also happens to run quickly on large data sets (but can be useful
> just 
> for the syntax alone) :
> 
>> setkey(DT,NAME)
>> DT[c("James","Tom"),mean(SCORE),mult="all"]
>       NAME       V1
> [1,] James 64.00000
> [2,]   Tom 78.66667
>>
> 
> Notice there is no "group by" or even a "by" in the above.  It inherits
> the 
> scope from the join because mult="all" means that "James" matches to 
> multiple rows, as does "Tom", creating two groups.  It does it by binary 
> search to the beginning of each group,  binary search to the end of the 
> group,  and runs the R expression inside the scope of that group.
> 
> An example of join inherited scoping for the side effects only :
> 
>> pdf("out.pdf")
>> DT[c("James","Tom"),plot(SCORE),mult="all"]
> NULL data table
>> dev.off()
> # out.pdf now contains 2 plots
> 
> which you couldn't do in SQL because SQL has no plotting (or any of R's 
> other packages).
> 
> It aims to do this quickly.  Where 'quickly' means 1) shorter code is 
> quicker to write, read, debug and maintain  and also  2) quicker to
> compute, 
> and its 1 that often dominates 2.
> 
> Finally, consider the following two statements which are both equivalent :
> 
>> sqldf("select NAME, avg(SCORE) from DF group by NAME having count(*) =
>> 3")
>    NAME avg(SCORE)
> 1 James   64.00000
> 2   Tom   78.66667
>> DT[ J(DT[,length(NAME),by="NAME"][V1==3,NAME]), mean(SCORE), mult="all"]
>    NAME avg(SCORE)
> 1 James   64.00000
> 2   Tom   78.66667
> 
> Now ok I hear you groaning (!) that the 2nd looks (on first glance) ugly, 
> but bear with me ... in the SQL solution do you know for sure that 
> avg(SCORE) isn't computed wastefully for the all the groups that don't
> have 
> count(*)=3 ?  It might well do the 'group by' first for all the groups,
> then 
> do the 'having' afterwards as a 'where' on the result.  It might depend on 
> the particular SQL database being used (mySQL, sqllite, etc) or the 
> installation parameters, any indexes etc.  Some investigation would be 
> required (taking time) if someone doesn't already know.  In the data.table 
> however,  the syntax explictly makes it clear than mean(SCORE) is only 
> computed for the particular groups.  For certain, always.   Maybe this 
> particular example is not a good one, but I'm trying to demonstrate an 
> overall syntax which is scalable (i.e. this syntax can do more complicated 
> things that SQL can't, or can't do well).   Notice that the method earlier 
> on i.e. "DT[,list(n=length(NAME),mean(SCORE)),by="NAME"][n==3]"  is
> simpler 
> but wasteful as it does compute mean(SCORE) for all the groups.  But the 
> syntax explicity conveys what is being done, and the user has the choice.
> 
> 
> "Gabor Grothendieck" <ggrothendieck at gmail.com> wrote in message 
> news:971536df1001051122l58389037p4e16288aedfdeb07 at mail.gmail.com...
> Here is the solution using sqldf which can do it in one statement:
> 
>> # read in data
>> Lines <- "OBS     NAME   SCORE
> + 1          Tom       92
> + 2          Tom       88
> + 3          Tom       56
> + 4          James    85
> + 5          James    75
> + 6          James    32
> + 7          Dawn     56
> + 8          Dawn     91
> + 9          Clara     95
> + 10        Clara     84"
>>
>> DF <- read.table(textConnection(Lines), header = TRUE)
>>
>> # run
>> library(sqldf)
>> sqldf("select NAME, avg(SCORE) from DF group by NAME having count(*) =
>> 3")
>    NAME avg(SCORE)
> 1 James   64.00000
> 2   Tom   78.66667
> 
> 
> On Tue, Jan 5, 2010 at 2:03 PM, Gabor Grothendieck
> <ggrothendieck at gmail.com> wrote:
>> Have a look at this post and the rest of that thread:
>>
>> https://stat.ethz.ch/pipermail/r-help/2010-January/223420.html
>>
>> On Tue, Jan 5, 2010 at 1:29 PM, Geoffrey Smith <gps at asu.edu> wrote:
>>> Hello, does anyone know how to take the mean for a subset of 
>>> observations?
>>> For example, suppose my data looks like this:
>>>
>>> OBS NAME SCORE
>>> 1 Tom 92
>>> 2 Tom 88
>>> 3 Tom 56
>>> 4 James 85
>>> 5 James 75
>>> 6 James 32
>>> 7 Dawn 56
>>> 8 Dawn 91
>>> 9 Clara 95
>>> 10 Clara 84
>>>
>>> Is there a way to get the mean of the SCORE variable by NAME but only 
>>> when
>>> the number of observations is equal to 3? In other words, is there a way 
>>> to
>>> get the mean of the SCORE variable for Tom and James, but not for Dawn 
>>> and
>>> Clara? Thank you.
>>>
>>> --
>>> Geoffrey Smith
>>> Visiting Assistant Professor
>>> Department of Finance
>>> W. P. Carey School of Business
>>> Arizona State University
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide 
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 

-- 
View this message in context: http://n4.nabble.com/mean-for-subset-tp999254p1008892.html
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list