The PISA rescoring project was coordinated between September 2024 and April 2025 by the OECD PISA Secretariat and involved 15 research teams scoring data from 16 national-language datasets (see Acknowledgements at the front of this report). Each research team was responsible for rescoring a sample of genuine student responses to the PISA 2022 creative thinking test from their respective national-language context. The project aimed to explore the rich data contained in raw student responses to the test across countries and economies as well as advance the practical application of scoring methods to open-ended creative tasks.
Creative minds in action
Annex A. PISA Rescoring project: Methods and data sample
Copy link to Annex A. PISA Rescoring project: Methods and data <strong>sample</strong>Items and data included in the study
Copy link to Items and data included in the studySelected items
The following 7 items were administered as part of the PISA 2022 Creative Thinking assessment and rescored as part of the PISA rescoring project:
Unit T200 – Science Fair Poster (Item 1) [Visual expression; Generate creative ideas]
Unit T300 – Illustration Titles (Item 2) [Written expression; Generate diverse ideas]
Unit T370 – 2983 (Item 1) [Written expression; Generate creative ideas]
Unit T400 – Save the Bees (Item 2) [Social problem solving; Generate creative ideas]
Unit T610 – Food Waste (Item 1) [Social problem solving; Generate diverse ideas]
Unit T690 – Save the River (Item 1) [Scientific problem solving; Generate diverse ideas]
Unit T700 – The Exhibit (Item 1) [Scientific problem solving; Generate diverse ideas]
These seven items represent diverse contexts (written expression, visual expression, social problem solving and scientific problem solving) and cognitive processes (generate diverse ideas, generate creative ideas), as described in the PISA 2022 Creative Thinking assessment framework (OECD, 2023[17]) and as presented in this report.
Data
A random sub-sample of up to 1000 students in each national-language sample was selected for this study. Due to the rotated test design used in PISA, each national-language dataset included around 300-400 unique responses (or fewer, if the total national-language sample was fewer than 1000 students) per item selected for inclusion in the study.
Scoring methods used in the study
Copy link to Scoring methods used in the studyEach research team applied two scoring methods to the randomly sampled responses for all 7 of the items included in the study. These were:
1. Holistic Judgement Method
2. Criterion-Based Method
Holistic Judgement Method
Objective of the scoring method
The Holistic Judgement Method (HJM) required judges to attribute a score between 1 (least creative) and 7 (most creative) to student responses. Judges attributed scores according to their holistic judgement of the creative quality of the response, considering elements such as its appropriateness, novelty, value and usefulness together when defining a final “holistic” score. For generate diverse ideas items, the whole response was given a single score (i.e. all ideas considered together). A score of 0 was attributed to responses that were aberrant, missing or clearly inappropriate.
The main objective of this scoring method was to establish a relative ordering of the creative quality of responses and identify the most creative responses within each national-language context. Judges were instructed to approximate a normal distribution in their scores in this method and thus attributed scores with respect to within-country comparisons of the creative quality of responses.
A simple general rubric was provided to judges across countries to facilitate common interpretations of the HJM scoring model and to provide guidelines for the number of responses in score categories (i.e. to approximate a normal distribution).
Coding process
In each national-language item sample, a minimum of 100 responses were double coded (or, if the total number of responses was fewer than 100, the whole sample).
As a first step, judges were instructed to familiarise themselves with all responses in their national-language item sample to facilitate the relative ranking of responses and attribution of score codes. Then, two judges scored 100 responses and reviewed the inter-rater reliability metrics (see below), engaging in calibration when necessary. Once a sufficient level of inter-rater reliability was achieved between the two judges, one judge proceeded to score the entire national-language sample of responses for that item.
Inter-rater reliability metrics
For all HJM scored items, the following three inter-rater reliability metrics were used:
Correlation coefficient (r) – r > 0.75.
Mean score difference < 1
Absolute value differences exceeding 1 score point < 20.
Criterion-Based Method
Objective of the scoring method
The Criterion-Based Method (CBM) required judges to attribute separate scores to students’ ideas based on the following criteria:
Appropriateness (0-2) – referring to the extent to which an idea respects the task instructions and constraints, is in the required format, and is relevant to the task content.
Originality (0-3) – refers to the extent to which a response presents a relatively uncommon, unusual, different, imaginative or innovative idea.
Value (0-3) – a refers to the extent to which a response is useful and impactful for its stated purpose.
For generate diverse ideas items, each idea within the student response was scored and given separate scores for appropriateness, originality and value. The whole response was then also attributed a score for flexibility (ranging from 0-1, or 0-2, depending on the number of ideas requested in a response), which referred to the extent to which the ideas were different from each other.
The main objective of this scoring method was to attribute separate scores for the multiple elements considered in a holistic evaluation and to identify cross-cultural differences in the relative weight attributed to different evaluation criteria when awarding holistic scores.
An item-specific coding guide was provided to research teams to facilitate a common understanding of each of the evaluation criteria as contextualised for each item.
Coding process
In each national-language item sample, a minimum of 100 responses were double coded (or, if the total number of responses was fewer than 100, the whole sample).
Judges completed the CBM scoring after the HJM scoring. First, the two judges scored the first 100 responses for appropriateness, followed by the rest of the criteria; if an idea received an appropriateness score of 0, it was not scored further for the remaining criteria.
After scoring the 100 responses according to each criteria, the two judges reviewed the inter-rater reliability metrics (see below), engaging in calibration when necessary. Once a sufficient level of inter-rater reliability was achieved between the two judges across the different criteria, one judge proceeded to score the entire national-language sample of responses for that item.
Inter-rater reliability metrics
For all CBM scored items, the following three inter-rater reliability metrics were used:
Correlation coefficient (r) – r > 0.75.
Mean score difference < 1
Absolute value differences exceeding 1 score point < 20 (for criteria of originality and value) and absolute value differences equal to or exceeding 1 score point < 20 (for appropriateness and flexibility).
Judges
Copy link to JudgesCriteria to be selected as a judge
Any judges participating in the study and responsible for scoring any responses to the PISA 2022 Creative Thinking items met the following criteria:
Engaged in the broader field of creativity research;
Fluent in the local culture (i.e. a native of the country or have lived in the national-language context for a significant period of time leading up to and including the year of the PISA 2022 administration).
Lead judges
Each research team was led by at least one experienced national lead researcher that acted as “lead judge” for the study in their national-language study. The lead researcher participated in anchor response training exercises coordinated by the OECD and were responsible for training other judges participating in their rescoring team, as well as implementing recalibration exercises if needed.