[R] Normalizing grouped data in a data frame

Fri Nov 9 16:41:09 CET 2007

On 11/9/2007 10:22 AM, Sandy Small wrote:
> Thank you very much.
> That works nicely.
> The trick I particularly needed was "within"which I didn't know about.

within() is new in 2.6.0; it's a nice addition.  There's also 
transform() which could be used in this situation, replacing

within(subset, { Norm_LVEF <- LVEF/max(LVEF)
                  Norm_ES_Time <- ES_Time/max(ES_Time)})

by (what I think is) the equivalent

transform(subset, Norm_LVEF = LVEF/max(LVEF),
                   Norm_ES_Time = ES_Time/max(ES_Time))

The within() function is somewhat more flexible, because you can execute
any block of code at all, rather than just simple assignments to columns.

If anyone with more experience with these functions (e.g. the author) 
notices any errors above, please correct me!

Duncan Murdoch

> Also nice to get a data frame out with "sparseby" instead of just a 
> mulit-array with "by"
> Sandy
> 
> Duncan Murdoch wrote:
>> Sandy Small wrote:
>>> Hi
>>> I am a newbie to R but have tried a number of ways in R to do this 
>>> and can't find a good solution. (I could do it out of R in perl or 
>>> awk but would like to know how to do this in R).
>>>
>>> I have a large data frame 49 variables and 7000 observations however 
>>> for simplicity I can express it in the following data frame
>>>
>>> Base, Image, LVEF, ES_Time
>>> A, 1,  4.32, 0.89
>>> A, 2, 4.98, 0.67
>>> A, 3, 3.7, 0.5
>>> A, 3. 4.1, 0.8
>>> B, 1, 7.4, 0.7
>>> B, 3, 7.2, 0.8
>>> B, 4, 7.8, 0.6
>>> C, 1, 5.6, 1.1
>>> C, 4, 5.2, 1.3
>>> C, 5, 5.9, 1.2
>>> C, 6, 6.1, 1.2
>>> C, 7. 3.2, 1.1
>>>
>>> For each value of LVEF and ES_Time I would like to normalise the 
>>> value to the maximum for that factor grouped by Base or Image number, 
>>> adding an extra column to the data frame with the normalised value in 
>>> it.
>>>
>>> So for the Base = B group in the data frame (the data frame should 
>>> have the same length I'm just showing the B part) I would get a 
>>> modified data frame as follows.
>>>
>>> Base, Image, LVEF, ES_Time, Norm_LVEF, Norm_ES_Time
>>> ...
>>> B,1,7.4, 0.7, 7.4/7.8, 0.7/0.8
>>> B, 3, 7.2, 0.8, 7.2/7.8, 0.8/0.8
>>> B, 4, 7.8, 0.6, 7.8/7.8, 0.6/0.8
>>> ...
>>>
>>> Where the results of the division would replace the division shown here.
>>> I hope this makes sense.
>>> If anyone can help I would be very grateful.
>>>   
>> You want to look at the by(), tapply() or sparseby() functions (the 
>> latter in the reshape package, the others are in base R).
>>
>> For example, I think this untested code does what you want:
>>
>> newdf <- sparseby(olddf, c("Base", "Image"),
>>                               function(subset)
>>                                    within(subset,
>>                                           { Norm_LVEF <- LVEF/max(LVEF)
>>                                              Norm_ES_Time <- 
>> ES_Time/max(ES_Time)
>>                                           }))
>>
>> where olddf is the old dataframe, and newdf is newly created.
>>
>> Duncan Murdoch
> 
> 
> **********************************************************************
> This message  may  contain  confidential  and  privileged information.
> If you are not  the intended  recipient please  accept our  apologies.
> Please do not disclose, copy or distribute  information in this e-mail
> or take any  action in reliance on its  contents: to do so is strictly
> prohibited and may be unlawful. Please inform us that this message has
> gone  astray  before  deleting it.  Thank  you for  your co-operation.
> 
> NHSmail is used daily by over 100,000 staff in the NHS. Over a million
> messages  are sent every day by the system.  To find  out why more and
> more NHS personnel are  switching to  this NHS  Connecting  for Health
> system please visit www.connectingforhealth.nhs.uk/nhsmail
> **********************************************************************