[R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset

Jeff Newmiller jdnewmil at dcn.davis.ca.us
Sun Jul 3 20:34:21 CEST 2016


I still get the impression from your mixing of information types that you are thinking like this is Excel.

Perhaps something like

drug_study$admin_period  <- ave( "Y" == drug_study$drug_admin, drug_study$ID, FUN=cumsum )
library(dplyr)
result0 <- (   drug_study
          %>% filter( 0 != admin_period )
          %>% group_by( ID, admin_period )
          %>% summarise( start = min( date ) )
          %>% mutate( admin_period1 = admin_period -1 )
          )
result <- (   result0 
          %>% select( -admin_period )
          %>% inner_join( result0 %>% select( ID, admin_period1, end=start )
                       , by = c( ID="ID", admin_period ="admin_period1" )
                        )
          %>% mutate( ddays = end - start )
          )
-- 
Sent from my phone. Please excuse my brevity.

On July 3, 2016 10:24:51 AM PDT, Kevin Wamae <KWamae at kemri-wellcome.org> wrote:
>HI Jeff, it’s been an uphill task working with the dataset and I am not
>the first to complain. Nonetheless, data-cleaning is ongoing and since
>I cannot wait for that to get done, I decided to make the most of what
>the dataset looks like at this time. It appears the process may take a
>while.
>
>Thanks for the script. From the output, I noticed that “result”
>contains the first and last date for each of the individuals and not
>taking into account the variable “drug-admin”. 
>
>ID	    start		end
>J1/3	    1/5/09	12/25/10
>R1/3	    1/4/07	12/15/08
>R10/1	    1/4/07	3/5/12
>
>My aim is to pick the date, for example in 2007, where drug-admin ==
>“Y” as my start and the date in the subsequent year (2008 in this case)
>where drug-admin == “Y” as my end. Then, I should populate the variable
>“study_id” with “start” up to the entry just above the one whose date
>matches “end”, as the output below shows (I hope its structure is
>maintained as I have copied it from R-Studio). The goal for now is to
>then get difference in days between “date” and “study_id” and still get
>to keep that column for “study_id” as I might use it later.
>
>From the output, it can be seen that for this individual, the dates run
>from 2007 to 2008. However, for some individuals, the dates run from
>2008-2009, 2009-2010 and so on. Therefore, I need to make the script
>deal with all the years as the dates range from 2001-2016
>
>ID	date	drug_admin	year	month	study_id
>R1/3	5/11/07	Y	2007	5	5/11/07
>R1/3	5/16/07		2007	5	5/11/07
>R1/3	5/22/07		2007	5	5/11/07
>R1/3	5/28/07		2007	5	5/11/07
>R1/3	6/5/07			2007	6	5/11/07
>R1/3	6/11/07		2007	6	5/11/07
>R1/3	6/18/07		2007	6	5/11/07
>R1/3	6/25/07		2007	6	5/11/07
>R1/3	7/2/07			2007	7	5/11/07
>R1/3	7/16/07		2007	7	5/11/07
>R1/3	7/29/07		2007	7	5/11/07
>R1/3	8/2/07			2007	8	5/11/07
>R1/3	8/7/07			2007	8	5/11/07
>R1/3	8/13/07		2007	8	5/11/07
>R1/3	9/18/07		2007	9	5/11/07
>R1/3	9/24/07		2007	9	5/11/07
>R1/3	10/6/07		2007	10	5/11/07
>R1/3	10/8/07		2007	10	5/11/07
>R1/3	10/15/07		2007	10	5/11/07
>R1/3	10/22/07		2007	10	5/11/07
>R1/3	10/29/07		2007	10	5/11/07
>R1/3	11/8/07		2007	11	5/11/07
>R1/3	11/12/07		2007	11	5/11/07
>R1/3	11/19/07		2007	11	5/11/07
>R1/3	11/29/07		2007	11	5/11/07
>R1/3	12/6/07		2007	12	5/11/07
>R1/3	12/10/07		2007	12	5/11/07
>R1/3	12/21/07		2007	12	5/11/07
>R1/3	1/7/08			2008	1	5/11/07
>R1/3	1/14/08		2008	1	5/11/07
>R1/3	1/21/08		2008	1	5/11/07
>R1/3	1/28/08		2008	1	5/11/07
>R1/3	2/4/08		Y	2008	2	
>
>
>Regards
>-------------------------------------------------------------------------------
>Kevin Wame 
>
>###############################################################
>
>###############################################################
>
>
>
>On 7/3/16, 7:05 PM, "Jeff Newmiller" <jdnewmil at dcn.davis.ca.us> wrote:
>
>result <- setNames( data.frame( aggregate( date~ID, data=drug_study,
>FUN=min ),  aggregate( date~ID, data=drug_study, FUN=max )[2] ), c(
>"ID", "start", "end" ) )
>
>
>______________________________________________________________________
>
>This e-mail contains information which is confidential. It is intended
>only for the use of the named recipient. If you have received this
>e-mail in error, please let us know by replying to the sender, and
>immediately delete it from your system.  Please note, that in these
>circumstances, the use, disclosure, distribution or copying of this
>information is strictly prohibited. KEMRI-Wellcome Trust Programme
>cannot accept any responsibility for the  accuracy or completeness of
>this message as it has been transmitted over a public network. Although
>the Programme has taken reasonable precautions to ensure no viruses are
>present in emails, it cannot accept responsibility for any loss or
>damage arising from the use of the email or attachments. Any views
>expressed in this message are those of the individual sender, except
>where the sender specifically states them to be the views of
>KEMRI-Wellcome Trust Programme.
>______________________________________________________________________



More information about the R-help mailing list