[R] r-data partitioning considering two variables (character and numeric)

MacQueen, Don m@cqueen1 @end|ng |rom ||n|@gov
Tue Aug 28 01:14:45 CEST 2018


And yes, I ignored Genotype, but for the example data none of the stand_ID values are present in more than one Genotype, so it doesn't matter. If that's not true in general, then constructing the grp variable is a little more complex, but the principle is the same.

--
Don MacQueen
Lawrence Livermore National Laboratory
7000 East Ave., L-627
Livermore, CA 94550
925-423-1062
Lab cell 925-724-7509
 
 

On 8/27/18, 4:10 PM, "R-help on behalf of MacQueen, Don via R-help" <r-help-bounces using r-project.org on behalf of r-help using r-project.org> wrote:

    You could start with split()
    
    grp <- rep('', nrow(mydata) )
    grp[mydata$stand_ID %in% c(7,9,67)] <- 'A-training'
    grp[mydata$stand_ID %in% c(3,18,20,21,32)] <- 'B-testing'
    
    split(mydata, grp)
    
    or perhaps
    
    grp <- ifelse(  mydata$stand_ID %in% c(7,9,67) , 'A-training', 'B-testing' )
    split(mydata, grp)
    
    -Don
    
    --
    Don MacQueen
    Lawrence Livermore National Laboratory
    7000 East Ave., L-627
    Livermore, CA 94550
    925-423-1062
    Lab cell 925-724-7509
     
     
    
    On 8/27/18, 3:54 PM, "R-help on behalf of Ahmed Attia" <r-help-bounces using r-project.org on behalf of ahmedatia80 using gmail.com> wrote:
    
        I would like to partition the following dataset (dataGenotype) based
        on two variables; Genotype and stand_ID, for example, for Genotype
        H13: stand_ID number 7 may go to training and stand_ID number 18 and
        21 may go to testing.
        
        Genotype    stand_ID    Inventory_date  stemC   mheight
        H13             7        5/18/2006  1940.1075   11.33995
        H13             7        11/1/2008  10898.9597  23.20395
        H13             7        4/14/2009  12830.1284  23.77395
        H13            18        11/3/2005  2726.42 13.4432
        H13            18        6/30/2008  12226.1554  24.091967
        H13            18        4/14/2009  14141.68    25.0922
        H13            21        5/18/2006  4981.7158   15.7173
        H13            21        4/14/2009  20327.0667  27.9155
        H15            9         3/31/2006  3570.06 14.7898
        H15            9         11/1/2008  15138.8383  26.2088
        H15            9         4/14/2009  17035.4688  26.8778
        H15           20         1/18/2005  3016.881    14.1886
        H15           20        10/4/2006   8330.4688   20.19425
        H15           20        6/30/2008   13576.5 25.4774
        H15           32        2/1/2006    3426.2525   14.31815
        U21           3         1/9/2006    3660.416    15.09925
        U21           3         6/30/2008   13236.29    24.27634
        U21           3         4/14/2009   16124.192   25.79562
        U21           67        11/4/2005   2812.8425   13.60485
        U21           67        4/14/2009   13468.455   24.6203
        
        And the desired output is the following;
        
        A-training
        
        Genotype    stand_ID    Inventory_date  stemC   mheight
        H13            7         5/18/2006  1940.1075   11.33995
        H13            7         11/1/2008  10898.9597  23.20395
        H13            7         4/14/2009  12830.1284  23.77395
        H15            9         3/31/2006  3570.06 14.7898
        H15            9         11/1/2008  15138.8383  26.2088
        H15            9         4/14/2009  17035.4688  26.8778
        U21            67        11/4/2005  2812.8425   13.60485
        U21            67        4/14/2009  13468.455   24.6203
        
        B-testing
        
        Genotype    stand_ID    Inventory_date  stemC   mheight
        H13             18       11/3/2005  2726.42 13.4432
        H13             18       6/30/2008  12226.1554  24.091967
        H13             18       4/14/2009  14141.68    25.0922
        H13             21       5/18/2006  4981.7158   15.7173
        H13             21       4/14/2009  20327.0667  27.9155
        H15             20       1/18/2005  3016.881    14.1886
        H15             20       10/4/2006  8330.4688   20.19425
        H15             20       6/30/2008  13576.5 25.4774
        H15             32       2/1/2006   3426.2525   14.31815
        U21             3        1/9/2006   3660.416    15.09925
        U21             3        6/30/2008  13236.29    24.27634
        U21             3        4/14/2009  16124.192   25.79562
        
        I tried the following code;
        
        library(caret)
        dataPartitioning <- createDataPartition(dataGenotype$stand_ID,1,list=F,p=0.2)
        train = dataGenotype[dataPartitioning,]
        test = dataGenotype[-dataPartitioning,]
        
        Also tried
        
        createDataPartition(unique(dataGenotype$stand_ID),1,list=F,p=0.2)
        
        It did not produce the desired output, the data are partitioned within
        the stand_ID. For example, one row of stand_ID 7 goes to training and
        two rows of stand_ID 7 go to testing. How can I partition the data by
        Genotype and stand_ID together?.
        
        
        
        Ahmed Attia
        
        ______________________________________________
        R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
        https://stat.ethz.ch/mailman/listinfo/r-help
        PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
        and provide commented, minimal, self-contained, reproducible code.
        
    
    ______________________________________________
    R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
    https://stat.ethz.ch/mailman/listinfo/r-help
    PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
    and provide commented, minimal, self-contained, reproducible code.
    



More information about the R-help mailing list