After selecting an appropriate forecasting model (Chapter 4, Figure 4.1), identifying the right data becomes more straightforward. In some cases, the model itself is data-driven – such as for machine learning approaches – or its functional form is determined by data availability. However, beyond considerations such as migration category and time horizon, other critical dimensions must be taken into account when selecting suitable data (Figure 5.1).
Migration Anticipation and Preparedness
5. What data are necessary to conduct the forecasting exercise?
Copy link to 5. What data are necessary to conduct the forecasting exercise?Figure 5.1. Decision tree to identify the best data according to migration forecasting models
Copy link to Figure 5.1. Decision tree to identify the best data according to migration forecasting models
How to assess the existing sources of data?
Copy link to How to assess the existing sources of data?Reliable data are the foundation of migration forecasting. However, data on migration often suffer from gaps, variability in quality and availability, and sometimes discontinuity. Among the most serious constraints are timeliness and frequency. Most data collections, whatever the migration category, are published several weeks or months after the reporting period, which make them of little use for short-term forecasting. Similarly, many data sources are available at the yearly or monthly level, which again may not match with the granularity of certain forecasting tasks. More generally, as increasingly sophisticated statistical models become available, as well as the necessary software and computational capacity, data can become the main constraint for forecasting exercises. This subsection provides a framework to assess existing data sources, ensuring they meet the quality and availability requirements necessary for robust forecasting.
The data used for forecasting, of whatever type and provenance, should be subject to a quality assessment, not only to gauge the different issues that can potentially impact the forecasts, but also to assess the associated measurement uncertainty that can be propagated into the predictions. To enable comparisons, and improve replicability, such an assessment can follow a standardised protocol, adapted where needed to suit specific needs and forecasting tasks. Elements of such a protocol would include creating standardised meta information about the data. Assembling and documenting the inventory not only helps provide crucial information on measurement uncertainty but also helps ensure smooth continuity of the forecasting processes, for example following staffing changes, as well as enabling quality assurance.
Assessing data quality also allows to better map gaps in data series. These gaps must not be considered as insurmountable hurdles, but opportunities for improvement. Indeed, gaps can be addressed by improving the data collection, through the use of alternative accurate data sources or the use of statistical techniques to fill them (such as imputations). If data sharing is part of the data collection process, it can also be improved through more regular contacts with data providers, data sharing agreements, with precise discussions about advantages to better share data for both providers and users. Going forward, the assessment of data gaps can be used as arguments to improve data infrastructures. In order to support these actions, financial resources must be invested accurately and organised, starting with the most troublesome data gaps preventing the forecasting model to be fully efficient.
Although quality requirements depend on the objectives of the forecasting task (mainly the type of migration and the forecasting horizon), the approach to quality assurance can be similar, whatever the migration category is. The first step consists of selecting the main criteria, which can be summarised in an assessment matrix. Important criteria that will likely be relevant to most forecasting tasks are definitions, frequency, coverage, accuracy, timeliness, granularity and continuity of available data (for a similar approach, see Bijak, Forster and Hilton (2017[1])).
Definitions: What is the object of measurement, and how clearly are the indicators defined? Data definitions refer to the descriptions or specifications that explain the meaning, format, and structure of data elements in a dataset.
Frequency: How often are data points collected or recorded over a given period of time? Data frequency determines the granularity and temporal resolution of a dataset.
Coverage: To what extent does the dataset capture relevant information across specific dimensions – such as time, geography, population, or subject areas? Coverage determines how comprehensively a dataset represents the phenomenon it aims to describe.
Accuracy: To what degree does the dataset correctly represent the real-world phenomena or conditions it aims to describe? Data almost always incorporate some degree of uncertainty, but can be more or less transparent about the biases they may have.
Timeliness: To what extent are data available and up to date in relation to the use case? Timeliness depends on how quickly data are collected, processed, and made accessible after an event or observation. Timely data can be essential for short-term forecast, but not necessarily for longer-term projections.
Granularity: What level of detail is required? For example, labour migration forecasting may require data disaggregated by industry, occupation and for some countries by region (see Box 8.2 for an example). Forecasting asylum applications may need not only sociodemographic information but the last county of residence (transit) to ensure accurate estimates.
Continuity: Will we have access to the data in future as well? Sometimes external factors – such as changes in data collection technology, software platforms, institutional priorities or funding can affect the ongoing availability of the data. Furthermore, will data characteristics (with respect to all previous criteria) remain stable overtime?
More detailed metadata may also include their type (do they relate to migration directly or its drivers (contextual variables), collection methods, in addition to a multidimensional quality assessment according to a pre‑defined set of criteria (Nurse, Hinsch and Bijak, 2023[2]). Data currently used by OECD countries for forecasting, described in Table 5.1, can be assessed based on these criteria systematically.
Probably the most important trade‑off is between timeliness (which always goes along with frequency) and accuracy. Accurate indicators are subject to strict protocols in terms of collection, validation and standardisation, which require a significant amount of time and effort. In turn, highly timely data may be subject to limited quality assessment procedures. Granularity of data may also be inversely related to both accuracy and timeliness, as more detailed data can be harder to obtain quickly and reliably.
Table 5.1. What data sources are currently used in OECD countries and by international organisations and agencies to forecast migration flows?
Copy link to Table 5.1. What data sources are currently used in OECD countries and by international organisations and agencies to forecast migration flows?|
Institution |
Migration type |
Variables included |
|
|---|---|---|---|
|
Quantitative |
Qualitative |
||
|
Frontex (CIRAM) |
Border Crossings |
Reports on illegal border crossings by entry points; country reports on illegal stay detections, facilitator detections, irregular migrant apprehensions, refusals of entry, asylum applications, fraudulent document user detections, smuggled goods, return decisions, effective returns, passenger flows, data from VIS (Visa Information System), SIS (Schengen Information System), Eurodac; airport passenger reports on passenger inflows, number of staff at borders, information on passengers profile (country of origin, reason for entering the country, trip details, transportation means), police report on information from police records on cross-border crimes (from EUROSUR and EU Member States (MS) reports: wanted persons, criminal records, intelligence, document thefts), intelligence reports on situation in third countries (push factors, economic crisis, large or small incidents, difficulty of routes, health threats), situation at borders (such as information on the smugglers goals, motives, modus operandi and capabilities; meteorological conditions; EU MS staff skills, equipment, operational practices at border, interoperability) and situation in EU MS (policy changes, pull factors, procedures at border, effectiveness of countermeasures) |
|
|
Germany |
Migration data |
Event data, unstructured textual data |
|
|
USA MAC |
Government of Mexico encounters/USA unemployment rate/ Remittances (Mexico, Guatemala, Colombia, Nicaragua /Ratio of removals divided by Title 8 encounters |
Fatalities from violence against civilians in origin country/Fatalities from other violence in origin country)/Disasters Judgmental adjustments from subject matter experts (SMEs) |
|
|
Belgium |
Asylum Applications |
Asylum applications |
Expert opinion from countries of origin, other EU MS and analytical experts |
|
EUAA |
Asylum applications |
Expert opinion from other EU MS |
|
|
France |
Asylum applications, Dublin Statistics |
Expert opinion from the Ministry of Interior and the Ministry of Foreign Affairs gathered in quarterly meetings, with inputs from EUAA reports, Frontex reports on Irregular Border Crossings (IBCs), French border police reports on border crossings, Ministry of Foreign Affairs country reports, French Refugee Office reports |
|
|
Germany |
Asylum applications |
||
|
Ireland |
Asylum applications |
||
|
Netherlands |
Asylum applications |
Expert opinion (other EU MS) |
|
|
Norway |
Asylum applications, border crossings in EU countries, different indicators in countries of origin and smuggling routes |
||
|
Poland |
Asylum applications |
Expert opinion |
|
|
Switzerland |
Situation in the countries of origin, situation on the routes, economic situation in the transit countries, measures taken by European destination countries in the area of (asylum) migration, Switzerland’s measures in the area of (asylum) migration, implementation of the Dublin Convention and EU Pact on migration and asylum by Switzerland |
||
|
USA CIS |
Family Migration |
Applications for family-based cases by family type |
Judgmental adjustments on recent procedure or policy changes |
|
Australia |
Labour migration |
Bilateral consultations with e.g. the department of Treasury, other federal administrations, States and Territories administration, researchers |
|
|
Japan |
Target GDP, capital stock, baseline labour force, per capita GDP growth rate by country, total population by country, number of foreign workers coming to Japan (gross and net flows) |
Scenario based on further automation. |
|
|
Korea |
Labour supply by sex, age, education, sector, and occupation, productivity growth, technological change |
Scenarios based on the substitutability of the labour force across sectors and skill levels |
|
|
Poland |
Work permits |
Qualitative research, using in particular migration intentions in the main countries/regions of origin |
|
|
Türkiye |
Expert opinion from public institutions, international organisations, academia and social partners |
||
|
Canada |
All categories of regulated flows |
Demand forecasting uses a mix of IRCC operational data which is available daily and can be grouped into weekly or monthly time series and external data from organisations such as Statistics Canada, Transport Canada, the International Air Transport Association (IATA), the Conference Board of Canada, and Oxford Economics. For modelling, monthly time series are usually used. These external datasets offer a wide range of socio‑economic indicators accessed through institutional subscriptions. Depending on the source, the information may be released annually or monthly and is updated on a monthly or quarterly basis. The indicators include economic measures that shape travel trends through changes in financial conditions and consumer behaviour, as well as social and geopolitical measures, since conflicts, disruptions, and other unexpected events can strongly influence global travel patterns. Taken together, these inputs provide a better overall picture and help improve forecast accuracy. |
Demand forecasting incorporates a wide range of qualitative inputs to complement quantitative data. This includes structured collaboration with subject-matter experts, environmental scanning to identify emerging trends, and the review of outbound travel reports. These insights are drawn from multiple reputable sources – such as Statistics Canada, Transport Canada, the International Air Transport Association (IATA), the Conference Board of Canada, and Oxford Economics – which provide industry intelligence and socio‑economic outlooks, influencing travel behaviour. Together, these qualitative elements help build a more comprehensive and forward-looking view of future demand. |
|
France |
Residence permits by reason |
||
|
Norway |
Residence permits by reason |
||
|
Finland* |
Quantitative operating environment data and immigration driving factors |
Qualitative operating environment data and immigration driving factors. Expert opinion from the Ministry of Interior gathered in a foresight team |
|
|
Sweden |
Asylum applications, residence permits by reason |
Scenarios prepared by experts |
|
|
United Kingdom |
Long-term international migration statistics by reason (Visas, British nationals), long-run stay rates |
Scenarios on changes in migration policy and situation in countries of origin |
|
* Family migration not included.
Checklist:
What sources of data do I have to mobilise from other ministries or from private service providers?
Is access to the data guaranteed over time?
Anticipating changes in data collection technology, software platforms, institutional priorities, database structure or funding may be necessary.
What is the level of quality and granularity of the data? Can it be improved?
The level of quality and granularity impacts the model choice. If it prevents the model to be fully efficient, improving data collection and infrastructures may be necessary steps.
Are data available under similar timeframe?
The data frequency and differences in timeframe and data availability (calendar vs. fiscal year, final vs. provisional) determine the forecast frequency.
Where to find and how to incorporate qualitative data?
Copy link to Where to find and how to incorporate qualitative data?Qualitative data have long been understood as non-numerical information describing qualities, characteristics, or attributes of a given phenomenon. As such, they capture the subjective and often contextual aspects of an issue, providing insights into the why and how of behaviours, attitudes and processes. Text analysis models, however, make it possible to quantify traditionally non-numerical data. Moreover, the explosion of online information and the growing availability of computational capacity have made available large volumes of (often real-time) traditionally qualitative data. This is the case, for example, with digital traces left by various types of behaviour, including human movements. As a result, “qualitative data” are increasingly used to describe data that originate from non-numerical sources but may be (and often are) turned into numerical quantities.
Qualitative data can provide critical insights into migration drivers, complementing quantitative datasets. Key sources include digital sources, expert knowledge, and information such as reports and open questions in surveys. Among digital sources potentially relevant to migration forecasting, we find event data, satellite imagery, web-search data, data on air passengers or social media data. OECD countries have been slow to use these for forecasting purposes (Table 5.1). However, each of these types of data may provide insights to forecast all kind migration categories, especially for more complex forced migration phenomena. They can be used as a proxy for motivations or intentions to migrate, actual mobility, displacements, attitudes, financial transactions, and so forth (for recent reviews, see Cesare et al. (2018[3]), Sirbu et al. (2020[4]), Iacus et al. (2022[5])).
Data on online searches have been used either in isolation (Böhme, Gröger and Stöhr, 2020[6]) to predict total international migration or combined with other data (Carammia, Iacus and Wilkin, 2022[7]) to predict asylum applications. Such data are made freely available by Google Trends (https://trends.google.com/) and can be checked via user-friendly dashboards and downloaded in tabular format. The main parameters that can be set include keywords (or topics, pre‑aggregated sets of keywords that have the advantage of being language‑insensitive), location (country or sub-national levels), and time window. Other digital data relevant to migration modelling are those about events in the world. Two established sources of such data are the Global Database of Events, Language, and Tone (GDELT) and the Armed Conflict Location & Event Data Project (ACLED). GDELT (https://www.gdeltproject.org/) is a repository of a large number of types of geolocated and content-coded event reported in the world’s broadcast, print and web media, in more than 100 languages. ACLED also collects and analyses data on political violence and protest events worldwide (see Box 5.1 for more details). Both sources are particularly useful for forecasting forced migration such as asylum and irregular border crossings.
Box 5.1. Examples of digital sources collecting qualitative data
Copy link to Box 5.1. Examples of digital sources collecting qualitative dataGoogle Trends data
Google Trends data do not provide information on absolute levels of searches, but on relative levels (ranging between 0‑100). When a single indicator is selected, the peak of the series represents the point in the selected time window with the maximum volume of searches for that topic at that point in time (and space). The rest of the points are relative to that peak. When instead more than one series is selected, they are scaled against each other. This means that for most modelling purposes, single indicators must be downloaded separately so that they can then be analysed individually. Running the analysis on different time windows results in different data, since, if new peaks in the volume of searches happen, the series must be re‑scaled. Several statistical packages exist that support downloading (via the API), processing and analysing the data. Among R packages, “gtrendsR” is the most established and is available on CRAN (https://cran.r-project.org/web/packages/gtrendsR/index.html). Newer packages include “trendecon” (https://trendecon.github.io/trendecon/index.html) or “gtrendsAPI” (Correia, 2024[8]).
GDELT
GDELT allows for real-time tracking of events, sentiment analysis and geospatial mapping of crises or trends. GDELT’s methodology emphasises coverage, automation, and scalability. The sources of data comprise various global news media, including traditional media outlets and blogs.
Data ingestion in GDELT is continuous, with updates every 15 minutes. Data collection is fully automated and based on scraping content from the monitored media sources, with non-English content translated automatically. Events are identified and coded with text-processing algorithms, with relevant codes describing the type of event, its location, the actors involved, and even the tone based on sentiment analysis. The largely automated approach to data collection and coding results in high coverage, frequency, and timeliness. This may be to the detriment of accuracy, although data can be inspected ex-post by the users as each observation provides a link to the original news item (links are however sometimes broken). Another potential source of bias is underreporting in media-poorer regions (and over-reporting in media-rich ones), in addition to ordinary biases in coverage and tone within media sources. GDELT provides access to its API, and R packages are available for querying and analysing both platforms. GDELT also permits direct analysis via Google BigQueries. The EUAA has already aggregated GDELT data relevant to asylum-related migration into a composite indicator, referred to as the Push Factor Index (PFI).
ACLED
Its methodology emphasises comprehensive coverage, high granularity, and systematic validation to provide reliable, real-time data. The sources of data include media reports, partner networks, academic and research publications, social media, and government and international reports. The data collection process includes more rigorous protocols for event identification, coding, validation and quality control, with involvement of human researchers. Data, in turn, are updated on a weekly basis. Although not comparable to GDELT with this respect, this is still an impressive timeliness considering the validation efforts, and may be sufficient for most forecasting tasks with respect to forced migration, except of course for early warning and nowcasting.
ACLED provides access to its APIs, and R packages are available for querying and analysing both platforms. GDELT also permits direct analysis via Google BigQueries.
1. See for example GDELTtools (https://cran.r-project.org/web/packages/GDELTtools/index.html).
2. See for example acled.api (https://cran.r-project.org/web/packages/acled.api/index.html) and acledR (https://github.com/dtacled/acledR)
Other important digital sources of qualitative data are social media platforms, including Twitter/X, Facebook, Instagram, or Linkedin (Iacus et al., 2022[5]). Applications can include sentiment analysis to measure migration intentions or attitudes towards migration policies or events; monitoring or nowcasting stocks and flows (Zagheni et al., 2014[9]) or advertising data, which were proven to anticipate migration movements (Minora et al. (2022[10]), Zagheni, Weber and Gummadi (2017[11])).
Challenges and limitations of social media data and digital sources include data privacy issues, bias and representativeness, and noise (Iacus et al., 2022[5]). In particular, data from web-scrapping or APIs may not be fully anonymised, which might lead to breach in personal data privacy. Definitions of population and migration in digital sources usually do not comply with official statistics definition. Moreover, it remains difficult to fully assess the representativeness of these sources, given these tools are not used by the full extent of the covered population (selectivity bias), and this use may change over time due to technological changes or target audience evolution. More broadly, the use of these sources may also vary a lot across countries, which does not always ensure comparability between destination and origin countries.
Another important challenge is the regulatory and policy instability of social media and digital data (Iacus et al., 2022[5]). Data sharing between private companies and public and research institutions vary a lot and is not for granted. Single platforms can change their data access policy at any time. A good example of this risk is X, which was widely popular among researchers for providing free access to its database via API, but suddenly moved to paid subscription plans in May 2023, a few weeks after change in ownership. Last, digital sources may face technical issues, such as power outages of their infrastructures, which might create breach in time series, such as the one faced by GDELT in July 2025. Although the use of these innovative sources help improving migration forecasts, all these caveats regarding their data quality need to be taken into account when choosing to include them in forecasting models.
Alongside innovative data sources, more traditional qualitative data can still play an important role in migration analysis and forecasting. Experts can be a crucial source of information. Besides providing insights into migration processes which can help to design or fine‑tune models, expert knowledge and intuition can moderate the forecasting models or feed directly into them. Qualitative expertise on scenarios plausibility, expert attribution of a probability to these forecasting scenarios, Delphi surveys and other expert elicitation methods are covered in the subsequent Chapter 6).
Checklist:
What sort of qualitative data are available and/or accessible?
Digital sources such as Google Trends, GDELT or ACLED are increasingly used in migration forecasting as qualitative data, although their use has some data quality caveats to be taken into account.
How to quantify policy indicators for statistical models?
Copy link to How to quantify policy indicators for statistical models?Migration policies of various types may be relevant predictors of migration. Quantifying policy indicators enables analysts to evaluate their impact and integrate this knowledge into statistical models. This subsection presents some approaches to identify, measure, and model policy indicators. It discusses their pros and limitations and proposes ways for integrating them.
As an analytical problem, developing policy indicators is a challenge similar to other qualitative sources, as non-numerical information must be quantified. However, compared to other qualitative sources, public policies more easily lend themselves to systematic classification. Indeed, several projects exist that generated quantitative indicators of migration policy. Here we briefly review the most established ones (for a comprehensive review, see Scipioni and Urso (2017[12])). However, there is a limitation of all datasets developed within the framework of academic research projects. Such projects are not developed by governmental institutions and typically rely on external financial support. At the end of their lifecycle, they either secure structural support or risk being discontinued – or, at best, updated at irregular intervals.
1. IMPIC (Immigration Policies in Comparison)
IMPIC (Helbling et al., 2017[13]) quantifies immigration policy restrictiveness across OECD countries, focussing on different migration categories (http://www.impic-project.org/). It covers 33 OECD countries from 1980 to 2010 and contains expert-coded assessments of policy restrictiveness. IMPIC categorises policies into four areas (labour migration, family reunification, asylum policies, and co‑ethnic migration), using a 0‑1 scale where 0 is least restrictive and 1 is most restrictive. Its strengths include the comprehensive coverage across time and countries, a differentiation among policy areas allowing for detailed analysis, and the use of a structured, transparent coding system with expert validation. The key limitation is the limited temporal coverage (1980-2010), which is an important issue in the migration context, as migration policies may change rapidly. Temporal coverage is a recurrent limitation of coding projects developed within academia, which lack continuous external financial support to ensure ongoing updates.
2. MIPEX (Migrant Integration Policy Index)
MIPEX, which evaluates migrant integration policies across multiple domains, is run by an NGO. It covers 56 countries (mostly Europe, but also Canada, the United States, Australia, and others), assessing policies in eight areas: labour market mobility, family reunion, education, political participation, permanent residence, access to nationality, anti-discrimination and health (http://www.mipex.eu/). It uses a 100‑point scale, where 100 indicates the best integration conditions. Key variables include the rights granted to migrants in each area, the legal framework and implementation, as well as some specific policy measures (such as language training or voting rights). MIPEX datasets also include a synthetic indicator measuring the overall policy approach to integration. MIPEX’s strengths include the large coverage, which make it the most comprehensive integration index available, and the longitudinal comparisons available for many countries. Limitations include the focus on integration policies only, with no measures of restrictiveness of immigration policy outside of those that have an impact on integration. MIPEX may therefore be particularly poorly adapted to quantify asylum policies or irregular migration enforcement law. Another limitation is a subjective scoring system based on expert judgement.
3. DEMIG Policy Database (Determinants of International Migration)
DEMIG tracks migration policy changes over time, differentiating between policies that become more restrictive or open (http://www.migrationinstitute.org/data/demig-data/demig-policy-1). The version originally developed at Oxford’s International Migration Institute (de Haas, Natter and Vezzoli, 2015[14]) covered 45 countries for the period 1945-2013 (and even further back for some countries). Recently, the dataset has been updated and extended within the framework of the QuantMig project (https://quantmig.eu/data_and_estimates/policy_database/), which released a version covering 31 European countries between 1990-2020 (Schreier et al. (2023[15]); for an application with an analysis of the “European migration policy mix”, Czaika et al. (2023[16])).
While for most policy indicators the level of analysis (single observation) is a country at a point in time (typically years), in the DEMIG dataset each observation is an occurrence of policy change.1 Each policy change is coded as a restricting (e.g. introducing visa restrictions, tightening deportation policies) or liberalizing policy (e.g. expanding labour migration quotas). Key variables are the policy area (e.g. labour migration, asylum), policy change type (expansion vs. restriction) and implementation mechanism (laws, executive orders, agreements). Clearly, coverage (the number of countries and the historical depth) is an important strength of DEMIG, as well as the quality of definition of policy categories. The main limitation is the lack of measures of absolute levels of restrictiveness/openness, and the lack of a complete matrix of country/year scores. Moreover, a future update of the dataset is not guaranteed, like other academic projects dependent on external financial support.
4. Dataset of World Refugee and Asylum Policies (DWRAP)
The DWRAP, developed by Blair, Grossman and Weinstein (2021[17]) is the first global resource that compiles de jure asylum and refugee policies (https://datanalytics.worldbank.org/dwrap/).
This dataset codes national laws into a scored index relevant to forcibly displaced populations across 193 countries, from 1951 (the year of the UN Refugee Convention) to 2022. It encompasses 54 indicators on five core dimensions: Access to asylum rights (the ease of entrance and security of status, such as non-admission policy, rights exclusion, stay, penalty for irregular entry, accompanying family rights, family reunification, appeal procedure, subsequent applications); access to services (provision of public services and welfare, such as education, vocational training, language courses, affirmative action, care, health cost, sick foreigner procedure, subsidies, welfare); access to livelihoods (the ability to work and own property, such as access to jobs, self-employment, regulated professions, tax policy, real estate access, financial goods seizure, intellectual property, rent rights); movement (free‑mobility, encampment policies, access to documentation and their costs); and participation in society (citizenship and political rights, such as naturalisation procedure, duration of residence to access rights, unaccompanied minor procedure, voter turnout, right to association). All indicators are compiled into five core dimension scores and then into a “comprehensive” score. Unlike IMPIC or DEMIG, the DWRAP project is intended to be transferred to JDC (the Joint Data Center on Forced Displacement from the World Bank and UNHCR), enabling potential updates on a regular basis.
5. UN-DESA Immigration Indicators within the World Population Policies Database
The United Nations Department of Economic and Social Affairs (UN-DESA) provides migration policy indicators through its World Population Policies Database (WPPD) (https://www.un.org/development/desa/pd/content/world-population-policies). These indicators focus on governments’ views and policies on international migration, covering aspects such as immigration levels, emigration, integration, and specific migration categories (e.g. skilled workers, refugees). This dataset has comprehensive spatial and temporal coverage, covering 195 countries and territories since 1976. However, it is updated only approximately every five years.
UN-DESA data focus on government perceptions and policy stances on migration, meaning it does not directly measure migration policy outputs. Key aspects include immigration, emigration, naturalisation, integration, border control, and refugees. The methodology is based on government-reported data collected via UN surveys and reports. Measures of governments’ policy objectives include immigration levels (increase, maintain, decrease), emigration policies (facilitating or restricting outflows), skilled migration (attracting high-skilled workers), refugee and asylum policies, and integration policies (citizenship, residency requirements). The dataset uses categorical rather than continuous quantitative measures, classifying policy types into three categories: no intervention (neutral stance), restrictive measures (e.g. tightening immigration rules) and liberal measures (e.g. encouraging immigration). In this sense, the approach resembles DEMIG. Limitations include the lack of direct measures of policy (since the data provide self-reported government positions, which may lead to potential bias), categorical (not numerical) classifications, and limited granularity compared to datasets like IMPIC, MIPEX, or DEMIG.
6. The OECD Indicators of Talent Attractiveness (ITA)
The OECD ITA is the first comprehensive tool designed to evaluate the strengths and weaknesses of OECD countries in attracting and retaining different types of talented migrants. It assesses countries across seven core dimensions (Quality of Opportunity, Income and Tax, Future Prospects, Family Environment, Skills Environment, Inclusiveness, Quality of Life) and Visa and Admission Policy. Initially developed in 2019, the tool also provides post-simulation results by introducing the most favourable migration policies for each country. The 2023 edition of the ITA (http://www.oecd.org/en/data/tools/talent-attractiveness-2023.html) expands on the previous version by including four categories of talented migrants: highly educated workers, foreign entrepreneurs, university students, and start-up founders. Additionally, the new edition includes an expanded set of dimensions (e.g. health) for assessing a country’s overall attractiveness. ITA is planned to be updated every three to five years.
In addition to the six indicators discussed above, Carammia and Iacus (2025[18]) are working on a method to extract the underlying shared variation from all these datasets, resulting in synthetic policy indicators. The approach is based on an algorithm developed for extracting continuous measures from heterogeneous survey data (Stimson, 2018[19]), something that standard factor analysis or principal component analysis cannot do because of the sparse nature of the data. The result are continuous measures of policy indicators, which can be estimated at different frequencies. The algorithm can ingest data at different frequencies and can work with matrices of varying density in terms of temporal coverage. In this way, the resulting indicator can fill temporal gaps among the various indicators. Moreover, because the resulting indicators describe the shared variation among the underlying data sources, they should also incorporate a smaller error.
To summarise, available policy indicators are quite different. Datasets vary in their objective, with some specialised in particular policy areas (e.g. MIPEX on integration policy); in their spatial and temporal coverage, covering different countries and time periods; and in their frequency. A general limitation is timeliness, as these datasets are not necessarily updated or have even been discontinued (as in the case of IMPIC), or their future update is unknown (DEMIG).
Timeliness is the most serious limitation of the indicators in their usefulness to feed in forecasting models. Indeed, political measures such as pull factors can rapidly impact variations in migration flow. Recent and frequently updated indicators are needed as input variables to forecasting models and exercises. In the end, most of these indicators are better suited for analyses of the past relationship between policy changes and migration flows. Supplementing these indicators with recent and high-frequently data such as social networks or web search engine queries could help solve the timeliness issue.
Checklist:
Is it possible to use some existing database on policy indicators to improve forecasting models? Is that dataset accurate and up to date, and does it focus on the right issues?
Should a specific parametrisation of the migration policies in my country be developed?
To be useful such parametrisation needs to ensure that the policy impact is neutrally assessed.
How to maintain policy indicators when most have unreliable funding and are infrequently updated to incorporate policy changes?
As policy indicators datasets typically rely on external financial support, regular updates require renewed funding and knowledge transfers.
Which software is the best fits for forecasting, causal effect estimation and projection?
Copy link to Which software is the best fits for forecasting, causal effect estimation and projection?Forecasting migration flows requires robust tools capable of handling diverse data types, forecasting horizons, and methodological approaches. Selecting the right software depends on the specific use case, the migration category covered, data requirements, and level of expertise among practitioners. Below is an overview of practical software packages in the R and Python language suited for time series analysis, forecasting, and causal effect estimation, tailored to the context of migration forecasting. R and Python are among the most widely used platforms for forecasting due to their extensive libraries, flexibility, and active user communities. These tools are well-suited for both traditional statistical models and modern machine learning approaches. We consider three large families of problems that can be addressed in R or Python: i) time series analysis and forecasting; ii) causal effect estimation for policy evaluation and iii) long-term projections. This list cannot be considered exhaustive, but its elements contain important building blocks. Most R packages and Python libraries come with on-line resources. R packages usually come with a so-called vignette or even journal papers, that contain many examples and reproducible codes. Python libraries often come with a notebook that can be executed on-line step by step. Note that both R and Python have low-level libraries and object types to deal with time series data, which are used across all the libraries and packages mentioned in this section.
1. Software for time series analysis in the context of migration forecasting
The R package forecast (Hyndman and Khandakar (2008[20]); Hyndman et al. (2024[21])) is the ideal starting piece of software for traditional time series models like ARIMA and it handles exponential smoothing, seasonal decomposition and automated model selection and parameter tuning. It is suitable for practitioners aiming for efficient workflows. ARIMAX is not a function of the R package forecast, but is covered in the TSA package. On the Python side, statsmodels (Seabold and Perktold, 2010[22]) supports ARIMA, SARIMA, ARIMAX and other classical econometric models. It includes diagnostic tools to validate assumptions and assess model performance. As the standard forecasting approach to time series extrapolation is to apply ARIMA models, typically within the frequentist statistical paradigm for long enough series (as a rule of thumb, at least 20 observations), most of formal forecasting models for all categories as well as net migration rely on time series analysis (Bijak et al., 2019[23]). These models however appear to perform either relatively well or relatively poorly, depending on the type of migration and data availability. For example, non-stable flows such as asylum, can be forecast using models that assume non-stationarity (e.g. random walk model), while more stable labour migration have more orderly features than a non-stationary model would predict.
Short data series may require additional expert input regarding the future migration or the features of the processes. A direct R implementation of Bayesian Structural Time Series (BSTS) models is available in the bsts package (Scott, 2024[24]).
For Vector Auto-regressive (VAR) and Structural VAR models, the entry R package is vars (Pfaff, 2008[25]). VAR models can be useful, for example, for modelling interdependencies in migration drivers like macroeconomic indicators (GDP, wages and salaries, unemployment rates, and employment indicators) and systemic shocks (policy shifts, lowering or raising of migration and trade barriers, or political crises). The methodology is flexible enough to allow the inclusion of other drivers of migration in the models as long as the relevant data series are available. When it comes to Bayesian Panel VAR forecasting, a MATLAB-based BEAR toolbox (Bayesian Estimation, Analysis and Regression) is a powerful tool for academics, central bankers and policymakers (Dieppe, van Roye and Legrand, 2016[26]). For example Barker and Bijak (2025[27]) attempts to forecast immigration, emigration and net migration rates for both long- and short-term horizons through macroeconomic models using mixed-frequency.
2. Software for causal effects estimation in the context of migration forecasting
R package CausalImpact (Brodersen et al., 2015[28]) uses Bayesian Structural Time Series (BSTS) to estimate causal effects of policy changes or interventions. It is ideal for evaluating the impact of migration policies such as visa reforms or border controls. The library causalimpact exists for Python too.
The package Synth (Abadie, Diamond and Hainmueller, 2011[29]) in R implements the Synthetic Control Methods (see Chapter 9) for policy evaluation. It allows practitioners to construct counterfactual scenarios to estimate causal impacts in non-experimental situations like in the migration policies context. An equivalent Python library is pysincon (Fordham, 2022[30]). These packages implement the basic version of the synthetic control methods. An approach that incorporates machine learning into the modelling is called Augmented Synthetic Control Method (Ben-Michael, Feller and Rothstein, 2021[31]). The corresponding software implementations are augsynth (Ben-Michael, 2025[32]) for R and again pysincon in Python.
3. Software for long-term demographic and migration projections
Projections require tools that can simulate complex systems, capture individual behaviours, and aggregate them into macro-level insights. Methods like agent-based modelling (ABM), micro-simulation, and macro-simulation provide powerful frameworks for migration projections. These models can be informed by both previously mentioned forecasting models and experts’ evaluations in different ways. In the class of ABM, among many software options, NetLogo (Wilensky, 1999[33]) can be used to create population projection models by simulating the behaviour of individual agents within a population. This allows you to explore how factors like birth rates, death rates, migration, and age structure can impact future population sizes over time, essentially creating a “virtual population” to study demographic trends under different scenarios. A well-known counterpart to NetLogo in Python for ABM is the Mesa (Kazil, Masad and Crooks, 2020[34]).
For micro-simulation projections, MicSim (Zinn, 2024[35]) is an R package suitable for analysing the impact of policies on individual-level migration decisions over time. Another well-known project is Modgen or Model Generator (Bélanger and Sabourin, 2017[36]) that enables the modelling of population dynamics and migration flows with detailed individual attributes. A sort of open-source counterpart is OpenM++ (OpenM++, 2024[37]) that has both R and Python bindings.
One open-source macro-simulation software is DAPPS (Demographic Analysis and Population Projection System) developed by the U.S. Census Bureau for which an R package is being developed.2 DAPPS is a comprehensive tool for demographic analysis and population projections. It utilises the cohort-component method and offers a modern graphical user interface for efficient workflow and enhanced data visualisation. For R packages used for UN long-term population projections (including migration), see Box 4.2.
4. Software for machine learning in the context of migration forecasting
On the machine learning side, the prophet package (Taylor and Letham, 2021[38]) developed by Facebook, handles irregularities (like strong seasonalities, changepoints, event data, etc) or missing observations. This approach has also been proposed in long-term migration trend projections. This model can accept external regression variables. The prophet library also exists for Python.
The package glmnet (Friedman, Hastie and Tibshirani, 2010[39]) that implements Elastic Net regularisation, forms the backbone of the DynENet or Dynamic Elastic Net model (Carammia, Iacus and Wilkin, 2022[7]) which allows for scalable modelling with high-dimensional data, ideal for irregular migration flows such as asylum applications. The Python library scikit-learn (Pedregosa et al., 2011[40]) can be used for a custom implementation of the DynENet model.
A popular Python library that implements modern deep-learning techniques, is darts (Herzen et al., 2022[41]). The library offers a unified interface for traditional and deep learning-based forecasting models. It includes support for Recursive Neural Networks, LSTM (Long Short Term Memory), and ensemble models for forecasting. The package TSLSTM (Paul and Yeasin, 2022[42]), implements LSTM in R. More recently, Golenvaux et al. (2020[43]) included in their forecast a yearly incoming migratory flow to members of the OECD using a short-term memory (LSTM) approach combined with Google Trends data.
Checklist:
Can the most adapted software for the foreseen model be purchased and uploaded?
Forecasters usually use R or Python to develop their models. Both are free‑of-charge software.
Can my old model be migrated to more efficient and flexible software?
Model software migration needs to avoid losing continuity of existing forecasts.
References
[29] Abadie, A., A. Diamond and J. Hainmueller (2011), “Synth: An R Package for Synthetic Control Methods in Comparative Case Studies”, Journal of Statistical Software, Vol. 42/13, https://doi.org/10.18637/jss.v042.i13.
[27] Barker, E. and J. Bijak (2025), “Mixed-frequency VAR: a new approach to forecasting migration in Europe using macroeconomic data”, Data & Policy, Vol. 7, https://doi.org/10.1017/dap.2024.82.
[36] Bélanger, A. and P. Sabourin (2017), Microsimulation and Population Dynamics, Springer International Publishing, Cham, https://doi.org/10.1007/978-3-319-44663-9.
[32] Ben-Michael, E. (2025), augsynth: The Augmented Synthetic Control Method, R package version 0.2.0..
[31] Ben-Michael, E., A. Feller and J. Rothstein (2021), “The Augmented Synthetic Control Method”, Journal of the American Statistical Association, Vol. 116/536, pp. 1789-1803, https://doi.org/10.1080/01621459.2021.1929245.
[23] Bijak, J. et al. (2019), “Assessing time series models for forecasting international migration: Lessons from the United Kingdom”, Journal of Forecasting, Vol. 38/5, pp. 470-487, https://doi.org/10.1002/for.2576.
[1] Bijak, J., J. Forster and J. Hilton (2017), Quantitative assessment of asylum-related migration: A survey of methodology.
[17] Blair, C., G. Grossman and J. Weinstein (2021), “Liberal Displacement Policies Attract Forced Migrants in the Global South”, American Political Science Review, Vol. 116/1, pp. 351-358, https://doi.org/10.1017/s0003055421000848.
[6] Böhme, M., A. Gröger and T. Stöhr (2020), “Searching for a better life: Predicting international migration with online search keywords”, Journal of Development Economics, Vol. 142, p. 102347, https://doi.org/10.1016/j.jdeveco.2019.04.002.
[28] Brodersen, K. et al. (2015), “Inferring causal impact using Bayesian structural time-series models”, The Annals of Applied Statistics, Vol. 9/1, https://doi.org/10.1214/14-aoas788.
[18] Carammia, M. and S. Iacus (2025), “Migration mood and policy responsiveness: a structural analysis of public opinion, policy, and migration flows in Italy (1990–2020)”, Journal of European Public Policy, Vol. 33/1, pp. 74-104, https://doi.org/10.1080/13501763.2025.2584564.
[7] Carammia, M., S. Iacus and T. Wilkin (2022), “Forecasting asylum-related migration flows with machine learning and data at scale”, Scientific Reports, Vol. 12/1, https://doi.org/10.1038/s41598-022-05241-8.
[3] Cesare, N. et al. (2018), “Promises and Pitfalls of Using Digital Traces for Demographic Research”, Demography, Vol. 55/5, pp. 1979-1999, https://doi.org/10.1007/s13524-018-0715-2.
[8] Correia, R. (2024), “gtrendsAPI: An R wrapper for the Google Trends API”, Software Impacts, Vol. 20, p. 100634, https://doi.org/10.1016/j.simpa.2024.100634.
[16] Czaika, M., H. Bohnet and F. Zardo (2023), “Categorical and spatial interlinkages within the European migration policy mix”, European Union Politics, Vol. 25/1, pp. 173-196, https://doi.org/10.1177/14651165231209941.
[14] de Haas, H., K. Natter and S. Vezzoli (2015), “Conceptualizing and measuring migration policy change”, Comparative Migration Studies, Vol. 3/1, https://doi.org/10.1186/s40878-015-0016-5.
[26] Dieppe, A., B. van Roye and R. Legrand (2016), “The BEAR toolbox”, European Central Bank Working Paper Series, Vol. 1934.
[30] Fordham, S. (2022), pysyncon: a Python package for the Synthetic Control Method.
[39] Friedman, J., T. Hastie and R. Tibshirani (2010), “Regularization Paths for Generalized Linear Models via Coordinate Descent”, Journal of Statistical Software, Vol. 33/1, https://doi.org/10.18637/jss.v033.i01.
[43] Golenvaux, N. et al. (2020), “An LSTM approach to Forecast Migration using Google Trends. :”, ArXiv, https://arxiv.org/abs/2005.09902.
[13] Helbling, M. et al. (2017), “measuring immigration policies: the IMPIC database”, European Political Science, Vol. 16/1, pp. 79-98, https://doi.org/10.1057/eps.2016.4.
[41] Herzen, J. et al. (2022), “Darts: User-Friendly Modern Machine Learning for Time Series,”, Journal of Machine Learning Research, Vol. 23/124, pp. 1-6.
[21] Hyndman, R. et al. (2024), forecast: Forecasting functions for time series and linear models. R package version 8.23.0.9000.
[20] Hyndman, R. and Y. Khandakar (2008), “Automatic Time Series Forecasting: The forecast Package for R”, Journal of Statistical Software, Vol. 27/3, https://doi.org/10.18637/jss.v027.i03.
[5] Iacus, S. et al. (2022), Data innovation in demography, migration and human mobility.
[34] Kazil, J., D. Masad and A. Crooks (2020), “Utilizing Python for Agent-Based Modeling: The Mesa Framework”, in Lecture Notes in Computer Science, Social, Cultural, and Behavioral Modeling, Springer International Publishing, Cham, https://doi.org/10.1007/978-3-030-61255-9_30.
[10] Minora, U. et al. (2022), “The potential of Facebook advertising data for understanding flows of people from Ukraine to the European Union”, EPJ Data Science, Vol. 11/1, https://doi.org/10.1140/epjds/s13688-022-00370-6.
[2] Nurse, S., M. Hinsch and J. Bijak (2023), “Mapping secondary data gaps for social simulation modelling: A case study of Syrian asylum migration to Europe”, Open Research Europe, Vol. 3, p. 216, https://doi.org/10.12688/openreseurope.15583.1.
[37] OpenM++ (2024), Open-source microsimulation modeling platform.
[42] Paul, D. and D. Yeasin (2022), TSLSTM: Long Short Term Memory (LSTM) Model for Time Series Forecasting. R package version 0.1.0..
[40] Pedregosa, F. et al. (2011), “Scikit-learn: Machine Learning in Python”, Journal of Machine Learning Research, Vol. 12, pp. 2825-2830.
[25] Pfaff, B. (2008), “VAR, SVAR and SVEC Models: Implementation Within<i>R</i>Package<b>vars</b>”, Journal of Statistical Software, Vol. 27/4, https://doi.org/10.18637/jss.v027.i04.
[15] Schreier, S., L. Skrabal and M. Czaika (2023), DEMIG-QuantMig Migration Policy Database.
[12] Scipioni, M. and G. Urso (2017), Migration Policy Indexes.
[24] Scott, S. (2024), bsts: Bayesian Structural Time Series. R package version 0.9.10..
[22] Seabold, S. and J. Perktold (2010), “Statsmodels: Econometric and Statistical Modeling with Python”, Proceedings of the Python in Science Conference, Proceedings of the 9th Python in Science Conference, pp. 92-96, https://doi.org/10.25080/majora-92bf1922-011.
[4] Sîrbu, A. et al. (2020), “Human migration: the big data perspective”, International Journal of Data Science and Analytics, Vol. 11/4, pp. 341-360, https://doi.org/10.1007/s41060-020-00213-5.
[19] Stimson, J. (2018), “The Dyad Ratios Algorithm for Estimating Latent Public Opinion”, Bulletin of Sociological Methodology/Bulletin de Méthodologie Sociologique, Vol. 137-138/1, pp. 201-218, https://doi.org/10.1177/0759106318761614.
[38] Taylor, S. and B. Letham (2021), prophet: Automatic Forecasting Procedure. R package version 1.0, https://CRAN.R-project.org/package=prophet.
[33] Wilensky, U. (1999), NetLogo..
[9] Zagheni, E. et al. (2014), “Inferring international and internal migration patterns from Twitter data”, Proceedings of the 23rd International Conference on World Wide Web, pp. 439-444, https://doi.org/10.1145/2567948.2576930.
[11] Zagheni, E., I. Weber and K. Gummadi (2017), “Leveraging Facebook’s Advertising Platform to Monitor Stocks of Migrants”, Population and Development Review, Vol. 43/4, pp. 721-734, https://doi.org/10.1111/padr.12102.
[35] Zinn, S. (2024), MicSim: Performing Continuous-Time Microsimulation. R package version 2.0.1. h.
Notes
Copy link to Notes← 1. Note that, as a result, the dataset is not a balanced and complete matrix of country/year points, with information on migration policy for each country/year. For a country/year there may be multiple rows, if multiple policy changes happened, followed by no data for the next year if no policy change happened. This means that the data need to be manipulated (possibly computing cumulative measures of change against a baseline) to obtain complete matrices.