--- title: "Introduction to maxRgain" subtitle: "An integer-programming method of maximizing genetic gains in polyclonal selection" author: "Sónia Surgy, Jorge Cadima, Elsa Gonçalves" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to maxRgain} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup, include=FALSE} library(maxRgain) ``` # Introduction Polyclonal selection is a breeding strategy developed for ancient grapevine varieties, but of more widespread interest. It seeks to select a group of different genotypes that collectively meet predefined desirable criteria for quantitative traits. Polyclonal selection, as opposed to the selection of a single genotype, helps to preserve genetic diversity and provides greater environmental stability, while still securing high genetic gains. This approach relies on the predictors of the genetic effects obtained from the fitting of a mixed model. Rather than focusing on individual genotypes, polyclonal selection considers the group as the unit of selection, with the goal of maximizing the overall predicted genetic gain across multiple traits of interest. Identifying such a superior group is an optimization problem, in which the objective is to select a subset of genotypes that jointly maximize the expected genetic value, subject to constraints which satisfy agronomic and enological interests. Most existing methods for genetic selection based on several traits, such as selection indices, are designed to maximize the genetic gain of selection of a single genotype. When the objective is to select a group of genotypes, rather than a single genotype, the conventional approach has been to simply choose the top-ranking genotypes. However, this strategy does not ensure that the selected group, as a whole, satisfies predefined breeding objectives when multiple traits are considered simultaneously. This package implements a novel Integer Programming-based method developed by Surgy et al. (2025), designed specifically for group-based selection under multiple trait criteria. While the methodology is demonstrated using data from grapevine breeding trials, it is broadly applicable across different species and breeding programs, especially in scenarios where group performance is more relevant than individual merit alone. For a complete description of the methodology, please refer to: Surgy, S., Cadima, J. & Gonçalves, E. Integer programming as a powerful tool for polyclonal selection in ancient grapevine varieties. Theor Appl Genet 138, 122 (2025). [https://doi.org/10.1007/s00122-025-04885-0](https://doi.org/10.1007/s00122-025-04885-0) Each function in the package includes an example that reproduces scenarios described in the original publication, thereby facilitating both practical understanding and a direct application of the method. This document provides a more detailed explanation of the outputs generated by the various functions and offers guidance on how to interpret them effectively. # Data The package includes a real dataset - Gouveio - to ensure the reproducibility of the examples and to allow users to explore additional scenarios. This dataset contains Empirical Best Linear Unbiased Predictors (EBLUPs) of genotypic effects, obtained from the fitting of a linear mixed model to phenotypic data from a selection field trial involving 150 genotypes of the grapevine variety Gouveio, established according to a resolvable row-column design. ```{r} head(Gouveio) ``` The traits evaluated were: * Yield (yd) * Potential Alcohol (pa) * Total Acidity (ta) * pH (ph) * Berry Weight (bw) The selection criteria are defined by the breeder. In this example, they are: to increase yield, potential alcohol, and total acidity, while decreasing pH and berry weight. # Common arguments Below is a table describing the common arguments used across the functions in this package: | Argument | Description | Used in functions | |--------------|------------------------------|-------------| | `traits` | A vector with the names of the traits to be optimized, i.e., those included in the objective function| All | | `ref` | Name of the reference column (e.g., genotype ID). Defaults to the first column. | All | | `clmin` | Integer, indicating the minimum group size (default is 1). | All | | `clmax` | Integer, indicating the maximum group size. If omitted, equal to `clmin` | All | | `dmg` | Can be of one of two types. In the standard form, it will be a dataframe with three columns defining constraints: trait names, constraint relations (`">="`, `"<="`), and *right-hand side* values. Alternatively, it can be given the text value `dmg = "base"` which sets the *right-hand side* of the constraints of all traits to zero. |`polyclonal()` | | `constraints`| Named numeric vector with traits to which constraints apply. If omitted, all except `ref` are used. |`rmaxa()` | | `meanvec` | Named numeric vector of the means for each trait. If omitted, data are assumed to be normalized (divided) by the mean. |All | | `criteria` | Named numeric vector with the criterion for each trait: 1 to increase, -1 to decrease. If omitted, all traits are assumed to be increasing. |All | | `data` | A dataframe with the predictors of genotypic effects for the several traits. Rows are genotypes; columns are traits. | All | # Important notes * If the gain for a group of a certain size is not displayed, it means that it was not possible to satisfy all the constraints for that group size. It is possible that constraints are not feasible for some group sizes but are feasible for others. * The genotypes selected in a group of a given size may differ substantially from those selected in a group of another size. Changing any condition — including the group size — results in a new optimisation problem. * The dataframe used for the `dmg` argument can have any column names, but the order must be respected: first the trait, then the relation (e.g., ">="), and finally the *right-hand side* value, that is, the value for the minimum desired gain in percentage of the overall mean of the trait. It should be noted that positive values are always understood as *improvements*, even for a criterion value of -1. Thus, >= 3 with criterion value -1 is understood as a decrease of at least 3% Consequently, minimum desired gains should be defined consistently, regardless of the selection criterion used. * If a loss in a trait is admissible, the maximum admissible loss must be indicated as a negative number on the right-hand side of the `dmg` argument. On the other hand, if the goal is to establish an upper bound on the increase of the gain of a trait, the relation in the `dmg` argument must be specified as "<=". * If no group of any requested size satisfies the specified conditions, the message 'No possible solution!' is returned. # Functions ## `polyclonal()` This is the main function of this package. ```{r, eval = FALSE} polyclonal(traits, ref = NULL, clmin = 2, clmax, dmg = NULL, meanvec = NULL, criteria = NULL, data) ``` In the example for the `polyclonal()` function, the objective is to maximize the genetic gain of groups with sizes ranging from 7 to 20 genotypes, while enforcing desired minimum gains for the five traits (expressed as percentages of the overall mean of each trait in the field trial): 20% for yield (yd), 3% for potential alcohol (pa), 3% for total acidity (ta), 1% for pH, and 2% for berry weight (bw). ```{r} # Named numeric vector with the trait means mymeanvec <- c(yd = 3.517, pa = 12.760, ta = 4.495, ph = 3.927, bw = 1.653) # Vector with the traits to be optimized mytraits <- c("yd", "pa", "ta", "ph", "bw") # Named numeric vector with the selection criteria mycriteria <- c(yd = 1, pa = 1, ta = 1, ph = -1, bw = -1) # Name of the reference column with the genotype labels myref = "Clone" ``` If `meanvec` is omitted, it is assumed that all trait values have already been standardized by dividing by their respective overall means. If `criteria` is omitted, the default assumption is that an increase is desired for all traits. If `ref` is omitted, the first column in the dataset is assumed to contain the genotype labels. In the `polyclonal()` function the constraints are indicated by argument `dmg` (desired minimum gains). The standard form of this argument is a data frame with three columns: the first specifying the traits to be constrained; the second with the constraint relation; the second with the relation of the constraint (">=" or "<="); and the third with the *right-hand side* values of the constraints. Column names are arbitrary, but their order must be preserved. ```{r} # dataframe with the desired minimum gains mydmg <- data.frame( lhs = c("yd", "pa", "ta", "ph", "bw"), rel = c(">=", ">=", ">=", ">=", ">="), rhs = c(20, 3, 3, 1, 2) ) ``` Note that the positive values of 1 and 2 for pH (ph) and berry weight (bw), which have selection criteria -1, mean a decrease of at least 1% in pH and a decrease of at least 3% in berry weight. ```{r} mydmg ``` Now, the `polyclonal` function is called to perform the selection based on the specified constraints: ```{r} # Using polyclonal() function with_dmg <- polyclonal( traits = mytraits, clmin = 7, clmax = 20, dmg = mydmg, meanvec = mymeanvec, criteria = mycriteria, data = Gouveio ) ``` Groups ranging from 7 to 20 genotypes were requested, as these cover the most appropriate group sizes for practical applications of the polyclonal methodology for grapevine selection. However, other group sizes can also be specified. In fact, it is possible to request a single group size by assigning the same value to both `clmin` and `clmax` or by omitting `clmax`. Analysing the output reveals how the imposed constraints influence the selection results. ```{r} # Results with_dmg ``` The output displays the genetic gain achieved for each trait across the different group sizes. All values are expressed as percentages of the trait mean. For example, in the 10-genotype group, the gains obtained were approximately 21.4%, 3.0%, 3.3%, 1.1% and 2.3% for yield (yd), potential alcohol (pa), total acidity (ta), pH (ph), and berry weight (bw), respectively. Positive values indicate gains in accordance with the selection criteria. In this case, although group sizes between 7 and 20 genotypes were requested, the object `gain` only provides results for groups of 7, 8, and 10 genotypes. This indicates that the desired minimum gains could not be achieved for other group sizes within the specified range. Examining the lists of genotypes within each group reveals that genotype groups of different sizes are not necessarily nested. For example, genotype GV060 is not included in the group of 7 genotypes, appears in the group of 8 genotypes, and is then excluded from the group of 10 genotypes. The `summary()` method can be used with `polyclonal()` object to retrieve a summary of the initial conditions and results. ```{r} # Summary results summary(with_dmg) ``` The `summary()` indicates how many groups were achieved for those conditions and the related dimensions. In the table, the first column gives the trait names, followed by the means, the criteria and the desired minimum gain (DMG). The Last two columns, "MaxGain" and "MaxGroup" show the maximum gain achieved for each trait and the group size where that gain occurs, respectively. The so-called *base situation*, as described in the original publication of the method (Surgy et al., 2025), consists in maximizing the predicted genetic gain of the target traits while ensuring that no trait decreases. To implement this, the right-hand side of all constraints is set to zero. To simplify the computation of this scenario, the argument `dmg` can be set to `"base"`. In that case, all traits present in the dataset are included as constraints, each required to be greater than or equal to zero, except for the reference trait specified in `ref`. If `ref` is omitted, the first column of the dataset is assumed to correspond to the reference trait. It is important to note that when using the "base" mode, the user must provide values for both `meanvec` and `criteria` for *all* traits in the dataset. If the user wishes to exclude any traits from the constraint set, the full specification of `dmg` must be provided instead. The following example illustrates the application of the base situation. Groups of 7 to 12 clones are selected under these conditions. The same values for `traits`, `meanvec`, and `criteria` are used throughout. ```{r} # Named numeric vector with the trait means mymeanvec <- c(yd = 3.517, pa = 12.760, ta = 4.495, ph = 3.927, bw = 1.653) # Vector with the traits to be optimized mytraits <- c("yd", "pa","ta", "ph", "bw") # Named numeric vector with the selection criteria mycriteria <- c(yd = 1, pa = 1, ta = 1, ph = -1, bw = -1) base_sit <- polyclonal( traits = mytraits, clmin = 7, clmax = 12, dmg = "base", meanvec = mymeanvec, criteria = mycriteria, data = Gouveio ) ``` Unlike what happens with a selection index, where increasing the group size typically involves the sequential inclusion of genotypes according to a fixed ranking, the method implemented here allows each group size to be independently optimized. As a result, groups of different sizes are not necessarily nested. ```{r} base_sit ``` For example, consider the groups of 7 and 8 genotypes. In the 8-genotype group, not just one but two genotypes differ from the groups of 7: genotypes GV089 and GV137. On the other hand, genotype GV132, which is included in the group of 7 genotypes, is not present in the group of 8. This illustrates how the inclusion of constraints can lead to changes in group composition, rather than simply adding the next best-ranked individual. In the base situation, `summary()` show all DMG constraints as 0. ```{r} summary(base_sit) ``` ## `rmaxp()` ```{r, eval = FALSE} rmaxp(traits, ref = NULL, clmin = 2 , clmax, meanvec = NULL, criteria = NULL, data) ``` The package includes a function `rmaxp()`, which computes the maximum possible gain for each trait. The maximum possible gain corresponds to the value obtained by selecting the top genotypes for each trait individually, without applying any constraints. The function allows the user to request the gain for one or multiple traits simultaneously. The output displays one column per trait; however, the values in different columns are not linked to the same solution and should be interpreted independently. In the following example, it is asked the maximum possible gain for `yd` and `pa` for group sizes from 9 to 20 genotypes. ```{r} # Named numeric vector with the trait means mymeanvec <- c(yd = 3.517, pa = 12.760) # Vector with the traits to be optimized mytraits <- c("yd", "pa") ``` The vector specifying the selection criteria can be omitted, as the default value of 1 is used for both traits. With this function, no constraints are applied. ```{r} # Using rmaxp() max_pos_gain <- rmaxp( traits = mytraits, clmin = 9, clmax = 20, meanvec = mymeanvec, data = Gouveio ) # Results max_pos_gain ``` As in previous outputs, one column per trait is shown. However, in this case, the gain reported in each column refers to an independent solution. For example, the group of 20 genotypes achieving a yield gain of 39.5% is not composed of the same genotypes as the group achieving a potential alcohol gain of 5.9%. This can be verified by examining the output stored in the selected object. Since no constraints are applied in this scenario, the output represents a ranking of genotypes for each trait. Increasing the group size simply adds the next genotype in the ranking. Thus, each column corresponds to the top genotypes for a given trait. The final column, Entry, indicates the smallest group size at which each genotype is first included. It means that, for example, for yield, the group size of 9 is composed by genotypes from GV084 to GV145. Genotype GV137 appears for the first time in yield when the group size reaches 10 genotypes. When the group size increases to 11, all the previous 10 genotypes remain in the group and genotype GV079 is added, and so on. No customized `summary()` is provided, given that the conditions do not vary and no constraints are applied. ## `rmaxa()` ```{r , eval=FALSE} rmaxa(traits, ref = NULL, clmin = 2, clmax, constraints = NULL, meanvec = NULL, criteria = NULL, data) ``` The package includes a function `rmaxa()`, which computes the maximum admissible gains. The maximum admissible gain corresponds to the value obtained by maximizing one trait while avoiding losses in all others, enforced through constraints—predefined in this case as *>= 0*. As with `rmaxp()`, the user may request gains for one or multiple traits simultaneously. The output displays one column per trait; however, the values in different columns correspond to independent solutions and should be interpreted separately. This distinguishes the output of the gain object from this function from that of the `polyclonal()` function with `dmg = "base"`, where the sum of the values related to all traits are optimized. ```{r} # Named numeric vector with the trait means mymeanvec <- c(yd = 3.517, pa = 12.760, ta = 4.495, ph = 3.927, bw = 1.653) # Vector with the traits to be optimized mytraits <- c("yd", "pa") # Named numeric vector with the selection criteria mycriteria <- c(yd = 1, pa = 1, ta = 1, ph = -1, bw = -1) ``` In this function, if the traits to which the constraints apply are not specified, all the traits are assumed to be greater or equal than zero. If this vector is omitted, all columns in the dataset are considered as constrained traits, except the one indicated in `ref` or the first column if `ref` is omitted. The `criteria` as well as the `meanvec` arguments must indicate the criteria for all the traits involved in both maximization and the constraints. This example requests groups with 12 to 20 genotypes. ```{r} # Using rmaxa() max_adm_gain <- rmaxa( traits = mytraits, clmin = 12, clmax = 20, meanvec = mymeanvec, data = Gouveio ) # Results max_adm_gain ``` As in the `rmaxp()` gain output, one column per trait is shown; however, the gain reported in each column corresponds to an independent solution. For example, the group of 20 genotypes achieving a gain of 26.2% for *yd* is not composed of the same genotypes as the group achieving a gain of 5.7% for *pa*. Furthermore, the gains obtained with `rmaxa()` for any trait are always less than or equal to those obtained with `rmaxp()`, since `rmaxa()` incorporates constraints during optimization. Regarding the `selected` object, due to the constraints applied in this function, the genotypes composing one group differ from those in other groups. Consequently, there is one selected object per trait, named as `selected_`. # Conclusion This package offers a ready-to-use implementation of the integer programming method for maximizing the predicted genetic gains in polyclonal selection developed by Surgy et al. (2025). It allowing breeders to efficiently and consistently perform group-based multi-trait selection, which supports a balanced genetic gain across multiple traits in breeding programs. A dataset and examples are provided to assist in understanding and reproducing results. # References Surgy, S., Cadima, J. & Gonçalves, E., 2025. Integer programming as a powerful tool for polyclonal selection in ancient grapevine varieties. Theor Appl Genet 138, 122. [https://doi.org/10.1007/s00122-025-04885-0](https://doi.org/10.1007/s00122-025-04885-0)