Migration Anticipation and Preparedness

Making Migration Management Work

Report

16 March 2026

Available in:

English
français

Download PDF

7. How can the model be evaluated?

Copy link to 7. How can the model be evaluated?

To evaluate the performance of migration forecasting models and to adapt them to evaluation results, the following framework can be applied, addressing robustness, metrics, and back-testing approaches. In the literature, the precise definitions of the related terms vary, but in this section, we consider evaluation to encompass a group of processes ensuring the quality of the forecasting models. It includes validation of the model performance – here understood as testing the alignment of model results with observed data –as well as calibration. In this particular context, calibration looks at the error measures (predictive intervals), and how they correspond to the relative frequencies of observations of different magnitudes seen in the data series. Machine Learning models, and hence statistical models, learn their parameters from historical data. While traditional statistical models are usually parsimonious in the number of parameters, machine learning models can encompass many. Both cases have a risk of over-fitting, but the risk can be much higher for some classes of machine learning models, such as those based on deep learning techniques. In light of this risk, performance evaluation and back-testing should be applied to both classes of models.

How to validate model performance?

Copy link to How to validate model performance?

The forecasting models, once built and estimated, need to have their performance validated. A number of tools (Box 7.1) and methods can be used to do so. A standard approach is to look at past errors, which can be compared to some benchmark, for example a constant forecast or one obtained by running a simple extrapolation, such as exponential smoothing or a basic ARIMA model, which relate the current magnitude of the processes being forecast to their past. The exercise typically involves running the models and the benchmarks on a shortened time series of data, for example setting aside the last five observations, and checking how well the forecasts were able to “predict” the data points that were set aside. This is known as validating the models ex post, or out of sample. Indeed, the data that are set aside are not a part of the sample and do not inform the model, so they can be used for independent validation. It differs from the ex-ante, or in-sample validation, which looks at how well the model fits the specific sample of data, and which can be measured by, for example, the determination coefficient (R²), various information criteria (AIC, BIC), and so on. For a given model, the ex-ante approach allows one to approximate the errors (how much the actual data differ from the predictions) based only on the data sample. However, the ex ante validation does not constitute, per se, a proof of the good predictive power of the model.

In turn, metrics involved in ex post validation include a range of error measures, the lower the error measures, the better.

Traditional Metrics: Hyndman and Athanasopoulos (2018[1]) give a basic introduction to Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and other traditional metrics used in forecasting. RMSE and MAE evaluate the quadratic and absolute distance, respectively, between the actual value of the target variable and its forecasts. RMSE or MAE are used for continuous migration flow forecasts. F1 Score, Precision, and Recall are particularly useful for “classification-based” migration forecasting. For example, a model that predicts “high”, “medium” and “low” migration flows rather than volumes of migration is a classification-based model. “Precision” measures how often the model’s forecasts for a class are correct, while “recall” checks how well the model captures all actual instances of that class. The F1 score balances both, ensuring accuracy and completeness. In “multiclass classification”, these metrics are calculated for each class and then averaged for overall performance.

Time Series-Specific Metrics: Hyndman (2006[2]) introduces Mean Absolute Scaled Error (MASE) and compares it with other metrics, such as MAPE. The Mean Absolute Percentage Error (MAPE) is commonly used for assessing forecast accuracy relative to observed values; the MASE is particularly useful when comparing across models or scales. These statistics are similar to the MAE but consider the error in relative or standardised terms.

Causal Validation: Runge et al., (2023[3]) discusses causal validation methods, particularly for dynamic and time‑series data in the context of causal forecasting. Comparing predicted impacts of interventions (e.g. policy changes) against known outcomes tests whether causal assumptions hold. This approach is applicable only when one or more intervention periods are well separated, although in many situations, interventions are often staggered.

Ex post errors can provide very useful information on whether there is a systematic bias in the forecasts, by how much do the models miss the subsequent real-life developments, and crucially, could the simple benchmarks have done a better job. In fact, the comparative aspect of forecast errors is one of the foundations of validating a range of models and choosing the best one. Ideally, evaluating forecasts should involve both aspects: present the ex-ante errors or associated measures to demonstrate that the model fits the data reasonably well, and ex post errors to show its predictive capabilities.

Box 7.1. An open-source toolkit to validate forecasting models: The SEAVEA project

Copy link to Box 7.1. An open-source toolkit to validate forecasting models: The SEAVEA project

Research teams from Brunel University London and University College London have developed since 2021 an open source toolkit dedicated to VVUQ (Verification, Validation and Uncertainty Quantification). The SEAVEA toolkit (Software Environment for Scalable & VVUQ-evaluated Exascale Applications) includes model verification (confirmation that the mathematical model and corresponding algorithm have been coded correctly), validation (of results compared to qualitative and quantitative measures which apply a validation metric) and uncertainty quantification (understanding the origins of and assessing the magnitudes of the errors which accompany computer simulations, whether epistemic or aleatoric).

Although originally developed for highly intensive exascale computer-based simulations, the SEAVEA Toolkit supports all fields of work that require event simulations to make predictions for decision making, such as fusion energy, climate science, epidemiology, medicine, aerospace, and migration. It then fits with exascale computing models as well as models using much smaller computing resources and smaller number of data points, such as those usually run for migration forecasts. SEAVEA includes a set of interoperable and advanced components (e.g. Easy VVUQ) in order to support modellers making their simulation more reproducible, reliable, and scientifically credible, eventually minimising the expense and time needed to perform calculations. All tools being open-source and having their own GitHub repositories, SEAVEA allows anyone to propose contributions and modifications to the toolkit.

Source: https://sites.google.com/view/seavea-toolkit/; https://excalibur.ac.uk/projects/seavea/.

Checklist:

Have I assessed the model’s predictive accuracy using standard metrics?
- Use indicators such as RMSE (Root Mean Squared Error) and MAE (Mean Absolute Error) to evaluate forecast performance.
Have I conducted an ex-post validation?
- Run the model and relevant benchmarks on a shortened time series to assess how well it predicts out-of-sample data.
Have I conducted an ex-ante validation?
- Use measures such as R² (coefficient of determination) and AIC (Akaike Information Criterion) to test how well the model fits the available data.
Have I interpreted both predictive power and model fit appropriately?
- Both ex post and ex ante assessments are needed to demonstrate the model’s overall robustness and reliability.

How to calibrate the model?

Copy link to How to calibrate the model?

For probabilistic forecasting models, another important aspect of evaluation is the calibration of predictive intervals. The aim here is to check whether the probabilistic measures produced by the model are reasonable, neither too narrow nor too wide. This is done by checking whether the predictive intervals predicted by the model ex ante are broadly in line with those obtained ex post, based on the data that have to be set aside. Typically, the 50‑per cent predictive intervals produced by the model should cover the future values roughly 50% of the time, 80‑per cent intervals roughly 80% of the time, and so on. One simple way of doing that is by calculating the empirical frequencies of instances, for which the real values fall inside (or outside) various predictive intervals (see Gneiting, Balabdaoui and Raftery (2007[4]) and Czado, Gneiting and Held (2009[5]), who refer to this approach as marginal calibration). For instance, if we set aside ten observations to carry out an out-of-sample forecast assessment, for well-calibrated forecasts, we would expect five out of ten of these observations to fall within the 50‑per cent predictive intervals, and eight out of ten within the 80‑per cent ones.

Predictive intervals that are too wide compared to their nominal probabilities (for example, if 50‑per cent intervals cover 90% of observations) are too conservative and can lead to unnecessary hedging against some extreme possibilities, while intervals that are too narrow (for example, when 80‑per cent intervals only cover 25% of observations) are unrealistically optimistic, and can lead to being underprepared. This is in addition to the question of which intervals to choose for calibration. Given the paucity of migration data and the fact that the extremes of the distribution may be very difficult to estimate precisely, there is an argument for calibrating the 50‑ and 80‑per cent intervals (Bijak et al., 2019[6]), rather than, for example, 99‑per cent ones, although in practical applications, the 95‑per cent intervals tend to be used in calibration exercises as well (see Bijak (2011[7]), or Azose and Raftery (2015[8])). Regardless of this choice, significantly miscalibrated forecasts require attention: they can indicate that either the chosen model is not appropriate for the data, which would necessitate specifying it again, or that the assumptions about the error term are either too narrow or too wide. This would require revisiting the prior assumptions for the related model parameters, such as the variance of the error term, possibly within a broader framework if iterative expert elicitation, as discussed in Chapter 6.

Forecast calibration has an important place in the methodological literature on forecasting. Brocker and Smith (2007[9]) explain the use of calibration curves and reliability assessments for probabilistic forecasting. Evaluating how well predicted probabilities align with observed outcomes requires the use of calibration curves or scores. Calibration curves show whether a model’s predicted probabilities align with actual outcomes, indicating whether it is over- or underconfident. Gneiting, Balabdaoui and Raftery (2007[4]) and Czado, Gneiting and Held (2009[5]) propose a set of different scoring rules for evaluating probabilistic forecasts with respect to both their sharpness and calibration. As an example, for predicting binary events, a simple scoring rule is the Brier score (Brier (1950[10]); see Tetlock and Gardner (2015[11])). The Brier score is the sum of squared differences between the predicted probabilities and the actual binary outcome (1 = the event happened, 0 = it did not): the lower the score, the better, and values below 0.25 (0.5², indicating a random guess) show the advantage of a certain model or approach. Gneiting and Raftery (2007[12]) also discusses the use of Continuous Ranked Probability Score (CRPS) as a proper scoring rule to help analyse, for instance, how well the model predicts not just point estimates but entire distributions. The CRPS measures the accuracy of probabilistic forecasts by comparing the predicted cumulative distribution to the actual outcome, with lower scores indicating better predictions. In applied migration forecasting, assessing calibration is becoming a part of the standard evaluation toolkit (Bijak et al., 2019[6]; Welch and Raftery, 2022[13]).

Checklist:

Have I included predictive intervals as part of the model evaluation?
- It is recommended to calibrate both 50% and 80% predictive intervals to assess the model’s reliability.
Do the observed outcomes fall within the expected ranges while conducting out-of-sample forecast assessments?
- For example, out of ten observations, approximately five should fall within the 50% interval and eight within the 80% interval.
Have I adjusted the model if the intervals are consistently too narrow or too wide?
- Poor calibration may indicate that the model underestimates or overestimates uncertainty and may require refinement.

How to adapt the model based on evaluation results or as data come by?

Error Diagnostics: Chatfield (2000[14]) discusses residual diagnostics and error decomposition for time‑series models. Examining residuals (e.g. the difference between predicted and actual values) is key to identifying systematic biases (e.g. underestimation during economic shocks or overfitting to seasonal migration patterns). This technique is inherited from standard linear regression analysis but in the context of time‑series may reveal additional patterns that an updated version of the model can take into account. Error decomposition techniques are particularly valuable in this context as they help to isolate issues like trend misalignment or spurious seasonality.

Model Refinement: Zou and Hastie (2005[15]) and Carammia et al. (2022[16]) apply regularised methods in machine learning and importance extraction. Model refinement may require introducing new features (e.g. policy indices, environmental indicators) or reweighting features based on importance analysis (e.g. SHAP values). Models may also be regularised more aggressively (e.g. elastic net method) or by simplifying overcomplex architectures to reduce overfitting. This refinement can be part of the model building in adaptive models, or be a separate ex-post step.

Dynamic Adaptation: Bontempi et al. (2013[17]) introduce online learning techniques for evolving datasets. Online learning techniques are used to update models continuously as new data arrives. Other adaptation techniques incorporate scenario-specific tuning based on historical analogues.

While model refinement is used after a model has been fitted to the whole historical data, i.e. when the assumption is that the model does not change with time, online learning and scenario-specific tuning are applied dynamically in adaptive models during the fitting. While Bontempi et al. (2013[17]) propose to update the model at each new observation, Carammia et al. (2022[16]) apply the idea to moving windows assuming that migration processes change over time but have some persistency over short periods.

Checklist:

Has the model failed validation or shown signs of poor performance?
- Examine residuals to detect patterns or structure that the model may have missed. For Machine learning models, they can be improved by introducing new features or adjusting the weights of existing ones.
Have I considered techniques for continuous model improvement?
- Online learning methods allow models to be updated in real time as new data become available, even before formal re‑evaluation.
Have I documented changes to the model following refinement?
- Keeping track of updates ensures transparency and supports reproducibility in the forecasting process.

How to assess robustness?

Copy link to How to assess robustness?

Robustness Testing: Barredo Arrieta et al. (2020[18]) review several robustness testing techniques, including adversarial testing, for AI models. These techniques include testing sensitivity of predictions to noise, missing data or out-of-distribution inputs. Performing adversarial tests requires introducing controlled perturbations to inputs.

Stress Testing: Kilian and Lütkepohl (2017[19]) discuss stress testing in economic and migration forecasting through extreme scenario simulations. Simulating extreme scenarios (e.g. sudden political crises or environmental disasters) is important to evaluate model stability.

Cross-Temporal Validation: Tashman (2000[20]) reviews cross-temporal and rolling origin validation methods. Performing back-testing across multiple historical periods helps to verify consistency over time, especially during disruptive events. This approach is a by-product of the workflow described in Carammia et al. (2022[16]).

Scenario Analysis: Predictions under hypothetical scenarios must be evaluated to assess the model’s adaptability to structural changes. This involves testing the model’s performance under various “what-if” conditions to understand its robustness and reliability. By simulating different hypothetical scenarios, such as economic shifts, policy changes, or unexpected events, it is possible to evaluate how well the model adapts to structural changes. This helps identify potential weaknesses, refine assumptions and improve decision making under uncertainty. Literature on this topic is still limited, as scenario analysis is very ad hoc and also requires the ability to generate scenarios through expert elicitation.

In all the above cases, the performance metrics summarised in the previous section are handy tools to assess the robustness of the forecasts.

Checklist:

Am I using the same metrics for robustness assessment as for performance validation?
- Metrics such as RMSE, MAE, R², and AIC can be applied to evaluate both model accuracy and robustness.
Have I tested the model under different conditions or scenarios?
- Robustness checks may involve varying input data, timeframes, or assumptions to ensure consistent performance.
Does the model perform reliably across different subgroups or data segments?
- Consistent results across contexts increase confidence in the model’s generalisability.
Have I documented how the model responds to changes in assumptions or input data?
- This helps determine the extent to which the model’s results are sensitive to external factors.

How to back-test ML/Stat models?

Copy link to How to back-test ML/Stat models?

The list of traditional and alternative approaches to model performance validation discussed above are all also, in practice, back-testing strategies, as they can be implemented only looking at historical data. Model performance validation and back-testing indeed use the same metrics. More precisely, model performance evaluation is an assessment of how well a model predicts outcomes based on historical data only. Back-testing, on the other hand, analyses how the model would have performed in a real-world past scenario and can also be performed on fictitious historical data. Compared to evaluation, back-testing is often used in finance or risk modelling, but much less in migration studies. It merely applies the model to historical data without retraining, mimicking real-time decision making and checking if it would have led to effective outcomes.

Classic back-testing for forecast models

Rolling Forecast Origin: Bergmeir et al. (2016[21]) discusses rolling origin validation for time‑series models. It takes two forms: splitting data into rolling windows to evaluate forecasts iteratively over different periods, or testing models’ ability to generalise across time and adjust for seasonality or structural breaks. It is a mix of both the cross-temporal validation of Tashman (2000[20]) and the adaptive method described in Carammia et al. (2022[16]).

Model diagnostics in ML/Stat models

Residual Analysis: Durbin and Koopman (2012[22]) make a general introduction to residual diagnostics for statistical time‑series models. Those diagnostics evaluate whether residuals are white noise (uncorrelated, normally distributed, with constant variance). They can also detect overfitting by analysing residual patterns over the training and validation datasets. This can be seen as an extension of the error diagnostic (Chatfield, 2000[14]).

Counterfactual Validation: Pearl (2009[23]) provides foundational methods for counterfactual validation in causal modelling. Causal models are used to evaluate counterfactual scenarios (e.g. simulating no-policy-change baselines). Counterfactual validation is not suitable for models that apply many predictors.

Checklist:

Is back-testing applied in migration forecasting studies?
- While underused, back-testing can provide valuable insights into model reliability and should be more widely adopted.
Have I used back-testing to evaluate model performance over historical periods?
- Back-testing applies the model to past data to assess how well it would have predicted known outcomes.

How can expert knowledge and scenarios be back-tested?

Copy link to How can expert knowledge and scenarios be back-tested?

Expert knowledge is not usually back-tested, nor is it generally expected to be back-tested, unlike what is usual for statistical or machine learning models. Expert knowledge often involves subjective judgment, qualitative insights, or hypothetical conditions that may not have occurred historically and may not ever occur in the future. Although there is no definitive literature on this point, expert knowledge can be stress-tested or validated using historical analogies, simulations, or comparative analyses. More precisely, a validity framework can be discussed in the following terms:

i. Historical alignment: if a past event closely resembles an expert-defined scenario, it would be possible to check whether the expert’s reasoning would have led to accurate predictions based on historical outcomes;

ii. Counterfactual analysis: using past data, it would be possible to test “what-if” scenarios by modifying certain variables to see if expert-driven insights would have correctly anticipated alternative outcomes;

iii. Simulation-based back-testing: if historical data isn’t available for a scenario, Monte Carlo simulations or agent-based models can be used to generate possible outcomes and test expert assumptions under varying conditions;

iv. Benchmarking against model predictions: comparing expert-driven insights with statistical models applied to past data to highlight inconsistencies and/or refine assumptions.

A notable exception in the literature is Imbens and Rubin (2015[24]), who provide the basic causal simulations for intervention evaluation. Validating against historical scenarios curated by experts ensures that model behaviour aligns with expected outcomes.

Synthetic Scenario Testing: Fraccascia et al. (2018[25]) explore scenario testing under synthetic conditions in complex system modelling. Similar ideas can be applied to the context of migration forecasting by constructing, for instance, synthetic scenarios reflecting plausible future conditions (e.g. economic downturns, climate shocks) and assessing model responses.

Simulated Interventions: Model robustness can be tested by introducing artificial interventions (e.g. introducing hypothetical policy changes) and comparing outcomes against expert expectations. See again Imbens and Rubin (2015[24]).

Expert-Elicited Feedback: Tetlock and Gardner (2015[11]) is a classic reference and it highlights how expert knowledge can enhance model evaluation through structured feedback and scenario development. Running models with and without expert-suggested features or constraints allows one to evaluate the impact on accuracy and explainability.

Checklist:

Have I considered testing the influence of expert knowledge on the model?
- While expert input is rarely back-tested, it is important to evaluate its impact where possible. By running the model in both configurations to assess how expert knowledge affects accuracy and predictive power.

Box 7.2. How the United States evaluates forecasted numbers

Copy link to Box 7.2. How the United States evaluates forecasted numbers

The United States Office of Homeland Security Statistics (OHSS) Migration Analysis Center (MAC) conducts various types of evaluations to forecast unauthorised migration at the Southwest US-Mexico border.

First, historical evaluations are performed to compare forecasts to actual outcomes on a rolling basis by lookahead month. These evaluations use statistics to assess both accuracy (Median Absolute Percent Error) and bias (Median Percent Error), as well as overall performance for up to a six‑month lookahead period. Generally, the one‑month lookahead is the most accurate compared to the three‑month or six‑month lookaheads. However, in some cases, the median absolute per cent error may be lower for the six‑month lookahead if there has been substantial encounter volatility, which creates additional challenges in forecasting accurately. Diagnostics are used to identify sub-groups with the greatest potential for improvement and to test model adjustments before implementing changes. For example, the OHSS MAC identified Cuba, Haiti, Nicaragua, and Venezuela (CHNV) as countries where models required improvement. In January 2023, OHSS MAC removed the trend component from the models for these four countries, which resulted in improved forecast accuracy. While one‑off fluctuations – such as Venezuelan migration surges in early 2023 – may still be missed, the models adjust to new realities after 2‑3 months of a shift in migration patterns.

Additionally, mid-month evaluations are conducted to compare the one‑month lookahead forecast to the encounter actuals for that month, extrapolated from the month-to-date daily average. Lastly, model comparison evaluations assess the historical performance of multiple model specifications. Confidence intervals are calculated using the formula for forecast intervals from the forecast package in R. However, the assumption of growing uncertainty over time, typical of a true forecast interval, is not applied. Since these interval bounds are primarily used for projections or planning lines, maximising the precision of the intervals has not been a priority.

References

[8] Azose, J. and A. Raftery (2015), “Bayesian Probabilistic Projection of International Migration”, Demography, Vol. 52/5, pp. 1627-1650, https://doi.org/10.1007/s13524-015-0415-0.

[18] Barredo Arrieta, A. et al. (2020), “Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI”, Information Fusion, Vol. 58, pp. 82-115, https://doi.org/10.1016/j.inffus.2019.12.012.

[21] Bergmeir, C., R. Hyndman and J. Benítez (2016), “Bagging exponential smoothing methods using STL decomposition and Box–Cox transformation”, International Journal of Forecasting, Vol. 32/2, pp. 303-312, https://doi.org/10.1016/j.ijforecast.2015.07.002.

[7] Bijak, J. (2011), Forecasting International Migration in Europe: A Bayesian View, Springer Netherlands, Dordrecht, https://doi.org/10.1007/978-90-481-8897-0.

[6] Bijak, J. et al. (2019), “Assessing time series models for forecasting international migration: Lessons from the United Kingdom”, Journal of Forecasting, Vol. 38/5, pp. 470-487, https://doi.org/10.1002/for.2576.

[17] Bontempi, G., S. Ben Taieb and Y. Le Borgne (2013), “Machine Learning Strategies for Time Series Forecasting”, in Lecture Notes in Business Information Processing, Business Intelligence, Springer Berlin Heidelberg, Berlin, Heidelberg, https://doi.org/10.1007/978-3-642-36318-4_3.

[10] Brier, G. (1950), “Verification of Forecasts Expressed in Terms of Probability”, Monthly Weather Review, Vol. 78/1, pp. 1-3.

[9] Bröcker, J. and L. Smith (2007), “Increasing the Reliability of Reliability Diagrams”, Weather and Forecasting, Vol. 22/3, pp. 651-661, https://doi.org/10.1175/waf993.1.

[16] Carammia, M., S. Iacus and T. Wilkin (2022), “Forecasting asylum-related migration flows with machine learning and data at scale”, Scientific Reports, Vol. 12/1, https://doi.org/10.1038/s41598-022-05241-8.

[14] Chatfield, C. (2000), Time-Series Forecasting, Chapman and Hall/CRC, https://doi.org/10.1201/9781420036206.

[5] Czado, C., T. Gneiting and L. Held (2009), “Predictive Model Assessment for Count Data”, Biometrics, Vol. 65/4, pp. 1254-1261, https://doi.org/10.1111/j.1541-0420.2009.01191.x.

[25] De Domenico, M. (ed.) (2018), “Resilience of Complex Systems: State of the Art and Directions for Future Research”, Complexity, Vol. 2018/1, https://doi.org/10.1155/2018/3421529.

[22] Durbin, J. and S. Koopman (2012), Time Series Analysis by State Space Methods, Oxford University Press, https://doi.org/10.1093/acprof:oso/9780199641178.001.0001.

[4] Gneiting, T., F. Balabdaoui and A. Raftery (2007), “Probabilistic Forecasts, Calibration and Sharpness”, Journal of the Royal Statistical Society Series B: Statistical Methodology, Vol. 69/2, pp. 243-268, https://doi.org/10.1111/j.1467-9868.2007.00587.x.

[12] Gneiting, T. and A. Raftery (2007), “Strictly Proper Scoring Rules, Prediction, and Estimation”, Journal of the American Statistical Association, Vol. 102/477, pp. 359-378, https://doi.org/10.1198/016214506000001437.

[1] Hyndman, R. and G. Athanasopoulos (2018), Forecasting: Principles and Practice, OTexts.

[2] Hyndman, R. and A. Koehler (2006), “Another look at measures of forecast accuracy”, International Journal of Forecasting, Vol. 22/4, pp. 679-688, https://doi.org/10.1016/j.ijforecast.2006.03.001.

[24] Imbens, G. and D. Rubin (2015), Causal Inference for Statistics, Social, and Biomedical Sciences, Cambridge University Press, https://doi.org/10.1017/cbo9781139025751.

[19] Kilian, L. and H. Lütkepohl (2017), Structural Vector Autoregressive Analysis, Cambridge University Press, https://doi.org/10.1017/9781108164818.

[23] Pearl, J. (2009), Causality: Models, Reasoning and Inference (, Cambridge University Press, USA.

[3] Runge, J. et al. (2023), “Causal inference for time series”, Nature Reviews Earth & Environment, Vol. 4/7, pp. 487-505, https://doi.org/10.1038/s43017-023-00431-y.

[20] Tashman, L. (2000), “Out-of-sample tests of forecasting accuracy: an analysis and review”, International Journal of Forecasting, Vol. 16/4, pp. 437-450, https://doi.org/10.1016/s0169-2070(00)00065-0.

[11] Tetlock, P. and D. Gardner (2015), Superforecasting: The Art and Science of Prediction, New York: Crown.

[13] Welch, N. and A. Raftery (2022), “Probabilistic forecasts of international bilateral migration flows”, Proceedings of the National Academy of Sciences, Vol. 119/35, https://doi.org/10.1073/pnas.2203822119.

[15] Zou, H. and T. Hastie (2005), “Regularization and Variable Selection Via the Elastic Net”, Journal of the Royal Statistical Society Series B: Statistical Methodology, Vol. 67/2, pp. 301-320, https://doi.org/10.1111/j.1467-9868.2005.00503.x.

Publications

Featured publications

Data

Featured data

News & events

Featured events

About

Engage with us

Work with us

Publications

Featured publications

Data

Featured data

News & events

Featured events

About

Engage with us

Work with us

Migration Anticipation and Preparedness

More info

Select a language

Cite this content as:

7. How can the model be evaluated?

How to validate model performance?

Box 7.1. An open-source toolkit to validate forecasting models: The SEAVEA project

How to calibrate the model?

How to adapt the model based on evaluation results or as data come by?

How to assess robustness?

How to back-test ML/Stat models?

Classic back-testing for forecast models

Model diagnostics in ML/Stat models

How can expert knowledge and scenarios be back-tested?

Box 7.2. How the United States evaluates forecasted numbers

References

Topics

Countries & regions

Data

Publications

News & Events

About

Featured topics

Agriculture and fisheries

Climate change

Development

Digital

Economy

Education and skills

Employment

Environment

Finance and investment

Governance

Health

Industry, business and entrepreneurship

Regional, rural and urban development

Science, technology and innovation

Society

Taxation

Trade

Energy

Nuclear energy

Transport

Featured topics

Agriculture and fisheries

Climate change

Development

Digital

Economy

Education and skills

Employment

Environment

Finance and investment

Governance

Health

Industry, business and entrepreneurship

Regional, rural and urban development

Science, technology and innovation

Society

Taxation

Trade

Energy

Nuclear energy

Transport

Countries A - C

Countries D - I

Countries J - M

Countries N - R

Countries S - T

Countries U - Z

Regional and global engagement

Countries

Countries A - C

Countries D - I

Countries J - M

Countries N - R

Countries S - T

Countries U - Z

Regional and global engagement

Publications

Publications

Featured publications

Data

Data

Featured data

News & events

News & events

Featured events

About OECD

About

Engage with us

Work with us

Featured topics

Agriculture and fisheries

Climate change

Development

Digital

Economy

Education and skills

Employment

Environment

Finance and investment