--- title: "Stochastic population forecasting" author: Rob J Hyndman bibliography: refs.bib output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Stochastic population forecasting} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( eval = nzchar(Sys.getenv("VIGNETTES")), # Only compile locally collapse = TRUE, comment = "#>", fig.width = 8, fig.height = 5, out.width = "100%" ) # Okabe-Ito colours for discrete scales options( ggplot2.discrete.colour = c("#D55E00", "#0072B2", "#009E73", "#CC79A7", "#E69F00", "#56B4E9", "#F0E442"), ggplot2.discrete.fill = c("#D55E00", "#0072B2", "#009E73", "#CC79A7", "#E69F00", "#56B4E9", "#F0E442") ) ``` ```{r setup, message = FALSE, warning = FALSE} library(vital) library(fable) library(dplyr) library(ggplot2) set.seed(2025) ``` The vital package can be used for stochastic population forecasting with coherent components. This is based on the papers by @HB08 and @hby. Following @HB08, we use the following demographic growth-balance equations: $$ P_t(x) = P_{t-1}(x) + B_t(x) - D_t(x) + G_t(x) $$ where * $P_t(x)$ is the population at time $t$ and age $x$, * $B_t(x)$ is the number of births at time $t$ and age $x$, * $D_t(x)$ is the number of deaths at time $t$ and age $x$, and * $G_t(x)$ is the number of net migrants at time $t$ and age $x$. There are slightly different equations for handling the upper age group, and for baby migrants. See @HB08 for details. To simulate future births, deaths and net migrants, we develop three functional data models for fertility, mortality and migration. The models for mortality and migration use coherent components, so that the rates for males and females do not diverge over time. The following example uses Norwegian data up to 2022, and produces simulated populations for the ten future years. Different models are used for each component to demonstrate the flexibility of the package, but other models can be used as well. ## Mortality model We use a coherent functional data model [@hby] for the log mortality rates. To ensure coherence, we compute the geometric mean of the sex-specific mortality rates and the corresponding ratios, using the `make_pr()` function. The data are smoothed first, and 6 components are used by default in each FDM, although only the first two are plotted here. ```{r mortality, fig.cap="", fig.alt="First two components of the functional data model for mortality rates."} fit_mortality <- norway_mortality |> filter(Sex != "Total") |> smooth_mortality(Mortality) |> make_pr(.smooth) |> model(fdm = FDM(log(.smooth), coherent = TRUE)) autoplot(fit_mortality, 2) ``` ## Fertility model For fertility, we use a functional mean model with a square root transformation, applied to the last 13 years of data. The plotted model shows the fitted values on the square root scale. ```{r fertility, fig.cap="", fig.alt = "Fitted values of the functional mean model for fertility rates."} fit_fertility <- norway_fertility |> filter(Year > 2010) |> smooth_fertility(Fertility) |> model(fmean = FMEAN(sqrt(.smooth))) autoplot(fit_fertility) ``` ## Net migration model For net migration, we use a coherent functional data model. Because net migration values can be positive or negative, we can't take products and ratios. Instead, we need to compute the means and corresponding differences using the `mean_sd()` function. ```{r migration, fig.cap="", fig.alt="First two components of the functional data model for net migration."} netmig <- net_migration( norway_mortality |> filter(Sex != "Total"), norway_births ) |> make_sd(NetMigration) fit_migration <- netmig |> model(fdm = FDM(NetMigration, coherent = TRUE)) autoplot(fit_migration) ``` ## Population simulation The `generate_population()` function takes a starting population, and the three component models, and simulates future age-sex-specific population values. Here we produce ten replicates of the future population. ```{r simulation} pop <- norway_mortality |> filter(Sex != "Total", Year == max(Year)) future <- generate_population( starting_population = pop, mortality_model = fit_mortality, fertility_model = fit_fertility, migration_model = fit_migration, h = 10, n_reps = 500 ) ``` The first replicate is plotted below, along with the last few years of historical data. ```{r population_plot, fig.cap="", fig.alt="Simulated population for the first replicate."} future |> filter(.rep == "100") |> ggplot(aes(x = Age, y = Population, group = Year, color = Year)) + geom_line( data = norway_mortality |> filter(Year > 2010, Sex != "Total"), color = "grey", mapping = aes(group = Year) ) + geom_line() + scale_color_gradientn(colours = rainbow(10)[1:9]) + facet_grid(. ~ Sex) ``` The simulated populations can be used to compute any quantities that can be derived from populations numbers by sex and age. For example, the mean age of the population for the next 10 years ```{r mean_age} future |> group_by(Sex, .rep) |> summarise(mean_age = sum(Population * (Age + 0.5)) / sum(Population)) |> group_by(Sex) |> summarise(mean_age = mean(mean_age)) ``` We can also plot population pyramids with prediction intervals. For example, here is the population pyramid for 2032 with a 95% prediction interval. ```{r population_pyramid, fig.cap="", fig.alt="Population pyramid for 2032 with 95% prediction intervals."} pyramid_2032 <- future |> filter(Year == 2032) |> mutate(Population = if_else(Sex == "Female", -Population, Population)) |> group_by(Age, Sex) |> summarise( lo = quantile(Population, 0.025), med = quantile(Population, 0.5), hi = quantile(Population, 0.975) ) pyramid_2032 |> ggplot(aes(x = Age)) + geom_ribbon(aes(ymin = lo, ymax = hi, colour = NULL), fill = "#c14b14", alpha = 0.2 ) + geom_line(aes(y = med), color = "#c14b14") + facet_grid(. ~ Sex, scales = "free_x") + labs(y = "Population") + coord_flip() + guides(fill = "none", alpha = "none") ``` ## References