After receiving the anonymised data from the countries, the OECD carried out a series of data validation and quality control checks. Most countries complied with the codebook provided by the OECD, although there were minor discrepancies, e.g. in variable names. In these cases, the OECD renamed variables to comply with the international codebook. After some minor editing, the countries’ datasets appeared as in the codebook, except for Italy, the Netherlands, and the Slovak Republic.
For Italy and the Netherlands, data were received in a coarser form and some data cleaning and manipulation was required to bring the data in line with international standards. Firstly, some variables had to be renamed to ensure that they conformed to the naming conventions set out in the codebook. Then, some new variables were created. Standard validation was then carried out, as for the other three countries, to ensure that all the values variables took were plausible. After this process, the datasets resembled the other countries and corresponded exactly to the standards of the codebook.
During the early analysis phase, it became apparent that the key variable on skill gaps (Q1) for the Slovak Republic had not been coded in accordance with the codebook. This issue was identified through deviating response patterns compared to other countries (issue also described above). Upon request, an updated dataset was provided, and standard validation procedures were carried out. After validation, the dataset aligned with those of the other countries and adhered fully to the codebook standards.
Also, during this phase, it became apparent that some variables required further recoding before analysis could be undertaken; for instance, for the Netherlands, the coding of the key variable on skill gaps (Q1) and the variable on how often a firm innovates (QA2) both required reverting the scale. Some additional minor recoding was made also to variables for Italy and Portugal, after which all variable labels aligned with the original codebook sent to countries.
Results of the international validation
After ensuring all five datasets were in line with the codebook, the OECD conducted several checks to ensure the quality and integrity of the data. For each of the country datasets, the OECD:
Validated responses to each question and checked which set of questions countries chose to ask;
Ensured there were no duplicate responses from the same employer;
Checked total sample size, and for each variable, checked the number of valid, invalid and missing values;
Checked logical consistency between variables.
The above checks confirmed that countries asked questions as set-out in Table A A.1. and that each enterprise was uniquely identified. Except for Italy, all countries provided a variable containing a random identification number to identify each enterprise in the dataset. To rectify this for Italy, the OECD confirmed with the relevant Italian authorities that each row was indeed a different enterprise and subsequently assigned their own random identifiers for each enterprise.
As mentioned above, some countries have chosen not to report certain variables in order to preserve anonymity – the full list of missing variables for each country can be found in Table A A.3. above. Most countries have some missing geographical information. Some countries chose to aggregate some variables in order to reduce the risk of identification. For example, Portugal reported the economic activity of the enterprise at the NACE 1‑digit level rather than at the most detailed 4‑digit level. Italy, Portugal and Hungary also chose to report the size of the enterprise as a categorical variable rather than a numerical one for confidentiality reasons. These cases of non-reporting are not expected to cause significant problems for current and future analysis, as the categories used are standard categories also used in the PIAAC household study and as it is unlikely that research will be carried out at detailed geographical or sectoral levels.
Finally, the logical consistency of variables was checked to further validate the integrity of the data. Table A A.4 details examples of the most important logical consistencies that were checked. Aside from a few instances, most countries passed these checks. In three instances, when data took on implausible or unexpected values, the check is marked as failed. Future analysis of the data may require some minor recoding of variables before conducting any analyses. Researchers also confirmed that variables took values that were logically consistent and as defined in the codebook – that is, variables took on values that were plausible and values across variables were consistent.