[R] meaning of formula in aggregate function

P Ehlers ehlers at ucalgary.ca
Sun Jan 23 14:38:54 CET 2011


Den wrote:
> Dear Dennis
> Thank you very much for your comprehensive reply and for time you've
> spent dealing with my e-mail.
> Your kindly explanation made things clearer for me. 
> After your explanation it looks simple.
> lapply with chosen options takes small part of cycle<n> with same id
> (eg. df[df$id==3,"cycle2"] and makes from it just a bunch of
> characters. 
> The only thing I still don't get is why how this code get rid out of
> NAs, but this is rather minor technical issue. Main question for me was
> in formula. You helped me indeed.

Okay, now I see what you're asking regarding the NAs.
I should have realized it before. Anyway, the answer
is in the function sort(). Have a look at its help
page and note what sort does when 'na.last=NA', the
default. You'll see where the NAs went.

Peter Ehlers

> Thank you again
> Have a nice day
> Denis
>>From bending but not broken Belarus
> У Суб, 22/01/2011 у 17:55 -0800, Dennis Murphy піша:
>> Hi:
>>
>> I wouldn't pretend to speak for Henrique, but I'll give it a shot.
>>
>> On Sat, Jan 22, 2011 at 4:44 AM, Den <d.kazakiewicz at gmail.com> wrote:
>>         Dear R community
>>         Recently, dear Henrique Dallazuanna literally saved me solving
>>         one
>>         problem on data transformation which follows:
>>         
>>         (n_, _n, j_, k_ signify numbers)
>>         
>>         SOURCE DATA:
>>         id      cycle1  cycle2  cycle3  …       cycle_n
>>         1       c       c       c               c
>>         1       m       m       m               m
>>         1       f       f       f               f
>>         2       m       m       m               NA
>>         2       f       f       f               NA
>>         2       c       c       c               NA
>>         3       a       a       NA              NA
>>         3       c       c       c               NA
>>         3       f       f       f               NA
>>         3       NA      NA      m               NA
>>         ...........................................
>>         
>>         
>>         Q: How to transform source data to:
>>         RESULT DATA:
>>         id      cyc1    cyc2    cyc3    …       cyc_n
>>         1       cfm     cfm     cfm             cfm
>>         2       cfm     cfm     cfm
>>         3       acf     acf     cfm
>>         ...........................................
>>         
>>         
>>         
>>         The Henrique's solution is:
>>         
>>         aggregate(.~ id, lapply(df, as.character), FUN =
>>         function(x)paste(sort(x), collapse = ''), na.action = na.pass)
>>
>> The first part, . ~ id, is the formula. It's using every available
>> variable in the input data on the left hand side of the formula except
>> for id, which is the grouping variable.
>>
>> The data object is lapply(df, as.character), which is a list object
>> that translates every element to character. I'm guessing that each
>> element of the list is a character string or list of character
>> strings, but I'm not sure. It looks like the individual characters of
>> each cycle comprise a list component within id. (??)  [My guess: the
>> result of lapply() is a list of lists. The top-level list components
>> correspond to the id's, while the second-level components are the
>> cycle variables, whose elements are the characters in each cycle
>> variable for each row with the same id.]
>>
>> The function to be applied to each id is described in FUN. As Peter
>> mentioned, it's an 'anonymous' function, which means it is defined
>> in-line. In this case, a generic input object x has its elements
>> sorted in increasing order and then combines the elements into a
>> single string (the purpose of collapse = ); NA values are skipped
>> over. Thus, if my hypothesis about the structure of the list is
>> correct, the three characters in each cycle/id combination are first
>> sorted and then combined into a single string, which is then output as
>> the result. By the way that Henrique used the formula, the aggregate()
>> function will march through each cycle variable within id and execute
>> the function, and then iterate the process over all id's. 
>>
>>         
>>         
>>         Could somebody EXPLAIN HOW IT WORKS?
>>         I mean Henrique saved my investigation indeed.
>>         However, considering the fact, that I am about to perform
>>         investigation
>>         of cancer chemotherapy in 500 patients, it would be nice to
>>         know what
>>         I am actually doing.
>>
>> Henrique's R knowledge is on a different level from most of us, so I
>> understand your question :) 
>>
>>         
>>         1. All help says about LHS in formulas like '.~id' is that
>>         it's
>>         name is "dot notation". And not a single word more. Thus, I
>>         have no
>>         clue, what dot in that formula really means.
>>
>> . is shorthand for 'everything not otherwise specified in the model
>> formula'. In this case, it represents the entire set of cycle
>> variables.
>>  
>>
>>         2. help says:
>>          Note that ‘paste()’ coerces ‘NA_character_’, the character
>>         missing
>>         value, to ‘"NA"'
>>         And at the same time:
>>          ‘na.pass’ returns the object unchanged.
>>         I am happy, that I don't have NAs in mydata.  I just don't
>>         understand
>>         how it happened.
>>         3. Can't see the real difference between 'FUN = function(x)
>>         paste(x)'
>>         and 'FUN = paste'. However, former works perfectly while
>>         latter simply
>>         do not.
>>         
>>         
>>         All I can follow from code above is that R breaks data on
>>         groups with
>>         same id, then it tear each little 'cycle' piece in separate
>>         characters,
>>         then sorts them and put together these characters within same
>>         id on each
>>         'cycle'. I miss how R put together all this mess back into
>>         nice data
>>         frame of long format. NAs is also a question, as I said
>>         before.
>>
>> By default, aggregate() will try to return a data frame. For each id,
>> it will output the id and the result of the function applied to each
>> cycle variable, so there should be one row for each id, and n + 1
>> columns for the n cycle variables + id.
>>
>> Does that help?
>>
>> Cheers,
>> Dennis 
>>
>>         
>>         Could you please put some light on it if you don't mind to
>>         answer those
>>         naive  questions.
>>         
>>         ______________________________________________
>>         R-help at r-project.org mailing list
>>         https://stat.ethz.ch/mailman/listinfo/r-help
>>         PLEASE do read the posting guide
>>         http://www.R-project.org/posting-guide.html
>>         and provide commented, minimal, self-contained, reproducible
>>         code.
>>
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list