Title: | Multi-Environment Genomic Prediction with Penalized Factorial Regression |
---|---|
Description: | Multi-environment genomic prediction for training and test environments using penalized factorial regression. Predictions are made using genotype-specific environmental sensitivities as in Millet et al. (2019) <doi:10.1038/s41588-019-0414-y>. |
Authors: | Kruijer Willem [aut] |
Maintainer: | Bart-Jan van Rossum <[email protected]> |
License: | GPL-3 |
Version: | 1.0.0 |
Built: | 2025-02-17 02:43:25 UTC |
Source: | https://github.com/cran/factReg |
These datasets come from the European Union project DROPS (DROught-tolerant
yielding PlantS). A panel of 256 maize hybrids was grown with two water
regimes (irrigated or rainfed), in seven fields in 2012 and 2013,
respectively, spread along a climatic transect from western to eastern
Europe, plus one site in Chile in 2013. This resulted in 28 experiments
defined as the combination of one year, one site and one water regime, with
two and three repetitions for rainfed and irrigated treatments, respectively.
A detailed environmental characterisation was carried out, with hourly
records of micrometeorological data and soil water status, and associated
with precise measurement of phenology. Grain yield and its components were
measured at the end of the experiment.
The data sets contain the genotypic BLUEs for eight traits for 246 genotypes
in 37 environments. Additionally information on 11 environmental indices
is included. The environments are split in three data sets for training
(drops_GE
) and testing (drops_GnE
, drops_nGnE
)
purposes. drops_K
contains the kinship matrix for the 246 genotypes.
drops_GE drops_GnE drops_nGnE drops_K
drops_GE drops_GnE drops_nGnE drops_K
data.frames with 24 variables.
experiments ID described by the three first letters of the city’s name followed by the year of experiment and the water regime with W for watered and R for rain-fed.
identifier of the genotype
genotypic mean for yield adjusted at 15\ in ton per hectare (t ha^-1)
genotypic mean for number of grain per square meter
genotypic mean for individual grain weight in milligram (mg)
genotypic mean for male flowering (pollen shed), in thermal time cumulated since emergence (d_20°C)
genotypic mean for female flowering (silking emergence), in thermal time cumulated since emergence (d_20°C)
genotypic mean for plant height, from ground level to the base of the flag leaf (highest) leaf in centimeter (cm)
genotypic mean for plant height including tassel, from ground level to the highest point of the tassel in centimeter (cm)
genotypic mean for ear insertion height, from ground level to ligule of the highest ear leaf in centimeter (cm)
Tnight.Early
night temperature averaged between the floral transition and the silk initiation
Tnight.Flo
night temperature averaged between the silk initiation and the end of grain abortion
Tnight.Fill
night temperature averaged between the end of grain abortion and the physiological maturity of the grain
Ri.Early
intercepted radiation cumulated between the floral transition and the silk initiation
Ri.Flo
intercepted radiation cumulated between the silk initiation and the end of grain abortion
Ri.Fill
intercepted radiation cumulated between the end of grain abortion and the physiological maturity of the grain
Psi.Flo
soil water potential averaged between the silk initiation and the end of grain abortion. The soil water potential used here was the median between 30 and 60cm depth.
Psi.Fill
soil water potential averaged between the end of grain abortion and the physiological maturity of the grain
Tmax.Early
maximum temperature averaged between the floral transition and the silk initiation
Tmax.Flo
maximum temperature averaged between the silk initiation and the end of grain abortion
Tmax.Fill
maximum temperature averaged between the end of grain abortion and the physiological maturity of the grain
type
code corresponding to the data set. GE for drops_GE, GnE for drops_GnE and nGnE for drops_nGnE.
An object of class data.frame
with 384 rows and 24 columns.
An object of class data.frame
with 224 rows and 24 columns.
An object of class matrix
(inherits from array
) with 302 rows and 302 columns.
Millet, E. J., Pommier, C., et al. (2019). A multi-site experiment in a network of European fields for assessing the maize yield response to environmental scenarios (Data set). doi:10.15454/IASSTN
Based on multi-environment field trials, fits the factorial regression model \(Y_{ij} = \mu + e_j + g_i + \sum_{k=1}^s \beta_{ik} x_{ij} + \epsilon_{ij},\) with environmental main effects \(e_j\), genotypic main effects \(g_{i}\) and genotype-specific environmental sensitivities \(\beta_{ik}\). See e.g. Millet et al 2019 and Bustos-Korts et al 2019. There are \(s\) environmental indices with values \(x_{ij}\). Optionally, predictions can be made for a set of test environments, for which environmental indices are available. The new environments must contain the same set of genotypes, or a subset.
Penalization: the model above is fitted using glmnet, simultaneously
penalizing \(e_j\), \(g_i\) and
\(\beta_{ik}\). If penG = 0
and penE = 0
, the main
effects \(g_i\) and \(e_j\) are not penalized. If these
parameters are 1, the the main effects are penalized to the same degree as
the sensitivities. Any non negative values are allowed. Cross validation is
performed with each fold containing a number of environments (details below).
After fitting the model, it is possible to replace the estimated
genotypic main effects and sensitivities by
their predicted genetic values. Specifically, if a kinship matrix K
is assigned the function performs genomic prediction with g-BLUP for the
genotypic main effect and each of the sensitivities in turn.
Predictions for the test environments are first constructed using the estimated genotypic main effects and sensitivities; next, predicted environmental main effects are added. The latter are obtained by regressing the estimated environmental main effects for the training environments on the average values of the indices in these environments, as in Millet et al. 2019.
GnE( dat, Y, G, E, K = NULL, indices = NULL, indicesData = NULL, testEnv = NULL, weight = NULL, outputFile = NULL, corType = c("pearson", "spearman"), partition = data.frame(), nfolds = 10, alpha = 1, lambda = NULL, penG = 0, penE = 0, scaling = c("train", "all", "no"), quadratic = FALSE, verbose = FALSE )
GnE( dat, Y, G, E, K = NULL, indices = NULL, indicesData = NULL, testEnv = NULL, weight = NULL, outputFile = NULL, corType = c("pearson", "spearman"), partition = data.frame(), nfolds = 10, alpha = 1, lambda = NULL, penG = 0, penE = 0, scaling = c("train", "all", "no"), quadratic = FALSE, verbose = FALSE )
dat |
A |
Y |
The trait to be analyzed: either of type character, in which case
it should be one of the column names in |
G |
The column in |
E |
The column in |
K |
A kinship matrix. Used for replacing the estimated genotypic main
effect and each of the sensitivities by genomic prediction from a g-BLUP
model for each of them. If |
indices |
The columns in |
indicesData |
An optional |
testEnv |
vector (character). Data from these environments are not used
for fitting the model. Accuracy is evaluated for training and test
environments separately. The default is |
weight |
Numeric vector of length |
outputFile |
The file name of the output files, without .csv extension
which is added by the function. If not |
corType |
type of correlation: Pearson (default) or Spearman rank sum. |
partition |
|
nfolds |
Default |
alpha |
Type of penalty, as in glmnet (1 = LASSO, 0 = ridge; in between = elastic net). Default is 1. |
lambda |
Numeric vector; defines the grid over which the penalty lambda is optimized in cross validation. Default: NULL (defined by glmnet). Important special case: lambda = 0 (no penalty). |
penG |
numeric; default 0. If 1, genotypic main effects are penalized. If 0, they are not. Any non negative real number is allowed. |
penE |
numeric; default 0. If 1, environmental main effects are penalized. If 0, they are not. Any non negative real number is allowed. |
scaling |
determines how the environmental variables are scaled. "train" : all data (test and training environments) are scaled using the mean and and standard deviation in the training environments. "all" : using the mean and standard deviation of all environments. "no" : No scaling. |
quadratic |
boolean; default |
verbose |
boolean; default |
A list with the following elements:
A data.frame with predictions for the training set
A data.frame with predictions for the test set
A data.frame with residuals for the training set
A data.frame with residuals for the test set
the estimated overall mean
The estimated environmental main effects, and the predicted effects, obtained when the former are regressed on the averaged indices, using penalized regression
The predicted environmental main effects for the test environments, obtained from penalized regression using the estimated main effects for the training environments and the averaged indices
data.frame containing the estimated genotypic main effects (first column) and sensitivities (subsequent columns)
a data.frame with the accuracy (r) for each training environment, as well as the root mean square error (RMSE), mean absolute deviation (MAD) and rank (the latter is a proportion: how many of the best 5 genotypes are in the top 10). To be removed or further developed. All these quantities are also evaluated for a model with only genotypic and environmental main effects (columns rMain, RMSEMain and rankMain)
A data.frame with the accuracy for each test environment, with the same columns as trainAccuracyEnv
a data.frame with the accuracy (r) for each genotype, averaged over the training environments
a data.frame with the accuracy (r) for each genotype, averaged over the test environments
The value of lambda selected using cross validation
The values of lambda used in the fits of glmnet. If
lambda
was provided as input, the value of lambda
is
returned
The root mean squared error on the training environments
The root mean squared error on the test environments
The name of the trait that was predicted, i.e. the column name
in dat
that was used
The genotype label that was used, i.e. the argument G that was used
The environment label that was used, i.e. the argument E that was used
The indices that were used, i.e. the argument indices that was used
The quadratic option that was used
Millet, E.J., Kruijer, W., Coupel-Ledru, A. et al. Genomic prediction of maize yield across European environmental conditions. Nat Genet 51, 952–956 (2019). doi:10.1038/s41588-019-0414-y
## load the data, which are contained in the package data(drops_GE) data(drops_GnE) ## We remove identifiers that we don't need. drops_GE_GnE <- rbind(drops_GE[, -c(2, 4)], drops_GnE[, -c(2, 4)]) ## Define indeces. ind <- colnames(drops_GE)[13:23] ## Define test environments. testenv <- levels(drops_GnE$Experiment) ## Additive model, only main effects (set the penalty parameter to a large value). Additive_model <- GnE(drops_GE_GnE, Y = "grain.yield", lambda = 100000, G = "Variety_ID", E = "Experiment", testEnv = testenv, indices = ind, penG = FALSE, penE = FALSE, alpha = 0.5, scaling = "train") ## Full model, no penalization (set the penalty parameter to zero). Full_model <- GnE(drops_GE_GnE, Y = "grain.yield", lambda = 0, G = "Variety_ID", E = "Experiment", testEnv = testenv, indices = ind, penG = FALSE, penE = FALSE, alpha = 0.5, scaling = "train") ## Elastic Net model, set alpha parameter to 0.5. Elnet_model <- GnE(drops_GE_GnE, Y = "grain.yield", lambda = NULL, G = "Variety_ID", E = "Experiment", testEnv = testenv, indices = ind, penG = FALSE, penE = FALSE, alpha = 0.5, scaling = "train") ## Lasso model, set alpha parameter to 1. Lasso_model <- GnE(drops_GE_GnE, Y = "grain.yield", lambda = NULL, G = "Variety_ID", E = "Experiment", testEnv = testenv, indices = ind, penG = FALSE, penE = FALSE, alpha = 1, scaling = "train") ## Ridge model, set alpha parameter to 0. Ridge_model <- GnE(drops_GE_GnE, Y = "grain.yield", lambda = NULL, G = "Variety_ID", E = "Experiment", testEnv = testenv, indices = ind, penG = FALSE, penE = FALSE, alpha = 0, scaling = "train")
## load the data, which are contained in the package data(drops_GE) data(drops_GnE) ## We remove identifiers that we don't need. drops_GE_GnE <- rbind(drops_GE[, -c(2, 4)], drops_GnE[, -c(2, 4)]) ## Define indeces. ind <- colnames(drops_GE)[13:23] ## Define test environments. testenv <- levels(drops_GnE$Experiment) ## Additive model, only main effects (set the penalty parameter to a large value). Additive_model <- GnE(drops_GE_GnE, Y = "grain.yield", lambda = 100000, G = "Variety_ID", E = "Experiment", testEnv = testenv, indices = ind, penG = FALSE, penE = FALSE, alpha = 0.5, scaling = "train") ## Full model, no penalization (set the penalty parameter to zero). Full_model <- GnE(drops_GE_GnE, Y = "grain.yield", lambda = 0, G = "Variety_ID", E = "Experiment", testEnv = testenv, indices = ind, penG = FALSE, penE = FALSE, alpha = 0.5, scaling = "train") ## Elastic Net model, set alpha parameter to 0.5. Elnet_model <- GnE(drops_GE_GnE, Y = "grain.yield", lambda = NULL, G = "Variety_ID", E = "Experiment", testEnv = testenv, indices = ind, penG = FALSE, penE = FALSE, alpha = 0.5, scaling = "train") ## Lasso model, set alpha parameter to 1. Lasso_model <- GnE(drops_GE_GnE, Y = "grain.yield", lambda = NULL, G = "Variety_ID", E = "Experiment", testEnv = testenv, indices = ind, penG = FALSE, penE = FALSE, alpha = 1, scaling = "train") ## Ridge model, set alpha parameter to 0. Ridge_model <- GnE(drops_GE_GnE, Y = "grain.yield", lambda = NULL, G = "Variety_ID", E = "Experiment", testEnv = testenv, indices = ind, penG = FALSE, penE = FALSE, alpha = 0, scaling = "train")
.... These models can be fitted either for the original data, or on the residuals of a model with only main effects.
perGeno( dat, Y, G, E, indices = NULL, indicesData = NULL, testEnv = NULL, weight = NULL, useRes = TRUE, outputFile = NULL, corType = c("pearson", "spearman"), partition = data.frame(), nfolds = 10, alpha = 1, scaling = c("train", "all", "no"), quadratic = FALSE, verbose = FALSE )
perGeno( dat, Y, G, E, indices = NULL, indicesData = NULL, testEnv = NULL, weight = NULL, useRes = TRUE, outputFile = NULL, corType = c("pearson", "spearman"), partition = data.frame(), nfolds = 10, alpha = 1, scaling = c("train", "all", "no"), quadratic = FALSE, verbose = FALSE )
dat |
A |
Y |
The trait to be analyzed: either of type character, in which case
it should be one of the column names in |
G |
The column in |
E |
The column in |
indices |
The columns in |
indicesData |
An optional |
testEnv |
vector (character). Data from these environments are not used
for fitting the model. Accuracy is evaluated for training and test
environments separately. The default is |
weight |
Numeric vector of length |
useRes |
Indicates whether the genotype-specific regressions are to be
fitted on the residuals of a model with main effects. If |
outputFile |
The file name of the output files, without .csv extension
which is added by the function. If not |
corType |
type of correlation: Pearson (default) or Spearman rank sum. |
partition |
|
nfolds |
Default |
alpha |
Type of penalty, as in glmnet (1 = LASSO, 0 = ridge; in between = elastic net). Default is 1. |
scaling |
determines how the environmental variables are scaled. "train" : all data (test and training environments) are scaled using the mean and and standard deviation in the training environments. "all" : using the mean and standard deviation of all environments. "no" : No scaling. |
quadratic |
boolean; default |
verbose |
boolean; default |
A list with the following elements:
Vector with predictions for the training set (to do: Add the factors genotype and environment; make a data.frame)
Vector with predictions for the test set (to do: Add the factors genotype and environment; make a data.frame). To do: add estimated environmental main effects, not only predicted environmental main effects
the estimated overall (grand) mean
The estimated environmental main effects, and the predicted effects, obtained when the former are regressed on the averaged indices, using penalized regression.
The predicted environmental main effects for the test environments, obtained from penalized regression using the estimated main effects for the training environments and the averaged indices.
data.frame containing the estimated genotypic main effects (first column) and sensitivities (subsequent columns)
a data.frame
with the accuracy (r) for each
test environment
a data.frame
with the accuracy (r) for each
training environment
a data.frame
with the accuracy (r) for
each genotype, averaged over the training environments
a data.frame
with the accuracy (r) for
each genotype, averaged over the test environments
The root mean squared error on the training environments
The root mean squared error on the test environments
The name of the trait that was predicted, i.e. the column name in dat that was used
The genotype label that was used, i.e. the argument G that was used
The environment label that was used, i.e. the argument E that was used
The indices that were used, i.e. the argument indices that was used
The quadratic option that was used