PISA Rescoring project: Methods and data sample

The PISA rescoring project was coordinated between September 2024 and April 2025 by the OECD PISA Secretariat and involved 15 research teams scoring data from 16 national-language datasets (see Acknowledgements at the front of this report). Each research team was responsible for rescoring a sample of genuine student responses to the PISA 2022 creative thinking test from their respective national-language context. The project aimed to explore the rich data contained in raw student responses to the test across countries and economies as well as advance the practical application of scoring methods to open-ended creative tasks.

The following 7 items were administered as part of the PISA 2022 Creative Thinking assessment and rescored as part of the PISA rescoring project:

Unit T200 – Science Fair Poster (Item 1) [Visual expression; Generate creative ideas]
Unit T300 – Illustration Titles (Item 2) [Written expression; Generate diverse ideas]
Unit T370 – 2983 (Item 1) [Written expression; Generate creative ideas]
Unit T400 – Save the Bees (Item 2) [Social problem solving; Generate creative ideas]
Unit T610 – Food Waste (Item 1) [Social problem solving; Generate diverse ideas]
Unit T690 – Save the River (Item 1) [Scientific problem solving; Generate diverse ideas]
Unit T700 – The Exhibit (Item 1) [Scientific problem solving; Generate diverse ideas]

These seven items represent diverse contexts (written expression, visual expression, social problem solving and scientific problem solving) and cognitive processes (generate diverse ideas, generate creative ideas), as described in the PISA 2022 Creative Thinking assessment framework (OECD, 2023[17]) and as presented in this report.

A random sub-sample of up to 1000 students in each national-language sample was selected for this study. Due to the rotated test design used in PISA, each national-language dataset included around 300-400 unique responses (or fewer, if the total national-language sample was fewer than 1000 students) per item selected for inclusion in the study.

Each research team applied two scoring methods to the randomly sampled responses for all 7 of the items included in the study. These were:

1. Holistic Judgement Method

2. Criterion-Based Method

Objective of the scoring method

The Holistic Judgement Method (HJM) required judges to attribute a score between 1 (least creative) and 7 (most creative) to student responses. Judges attributed scores according to their holistic judgement of the creative quality of the response, considering elements such as its appropriateness, novelty, value and usefulness together when defining a final “holistic” score. For generate diverse ideas items, the whole response was given a single score (i.e. all ideas considered together). A score of 0 was attributed to responses that were aberrant, missing or clearly inappropriate.

The main objective of this scoring method was to establish a relative ordering of the creative quality of responses and identify the most creative responses within each national-language context. Judges were instructed to approximate a normal distribution in their scores in this method and thus attributed scores with respect to within-country comparisons of the creative quality of responses.

A simple general rubric was provided to judges across countries to facilitate common interpretations of the HJM scoring model and to provide guidelines for the number of responses in score categories (i.e. to approximate a normal distribution).

Coding process

In each national-language item sample, a minimum of 100 responses were double coded (or, if the total number of responses was fewer than 100, the whole sample).

As a first step, judges were instructed to familiarise themselves with all responses in their national-language item sample to facilitate the relative ranking of responses and attribution of score codes. Then, two judges scored 100 responses and reviewed the inter-rater reliability metrics (see below), engaging in calibration when necessary. Once a sufficient level of inter-rater reliability was achieved between the two judges, one judge proceeded to score the entire national-language sample of responses for that item.

Inter-rater reliability metrics

For all HJM scored items, the following three inter-rater reliability metrics were used:

Correlation coefficient (r) – r > 0.75.
Mean score difference < 1
Absolute value differences exceeding 1 score point < 20.

Objective of the scoring method

The Criterion-Based Method (CBM) required judges to attribute separate scores to students’ ideas based on the following criteria:

Appropriateness (0-2) – referring to the extent to which an idea respects the task instructions and constraints, is in the required format, and is relevant to the task content.
Originality (0-3) – refers to the extent to which a response presents a relatively uncommon, unusual, different, imaginative or innovative idea.
Value (0-3) – a refers to the extent to which a response is useful and impactful for its stated purpose.

For generate diverse ideas items, each idea within the student response was scored and given separate scores for appropriateness, originality and value. The whole response was then also attributed a score for flexibility (ranging from 0-1, or 0-2, depending on the number of ideas requested in a response), which referred to the extent to which the ideas were different from each other.

The main objective of this scoring method was to attribute separate scores for the multiple elements considered in a holistic evaluation and to identify cross-cultural differences in the relative weight attributed to different evaluation criteria when awarding holistic scores.

An item-specific coding guide was provided to research teams to facilitate a common understanding of each of the evaluation criteria as contextualised for each item.

Coding process

In each national-language item sample, a minimum of 100 responses were double coded (or, if the total number of responses was fewer than 100, the whole sample).

Judges completed the CBM scoring after the HJM scoring. First, the two judges scored the first 100 responses for appropriateness, followed by the rest of the criteria; if an idea received an appropriateness score of 0, it was not scored further for the remaining criteria.

After scoring the 100 responses according to each criteria, the two judges reviewed the inter-rater reliability metrics (see below), engaging in calibration when necessary. Once a sufficient level of inter-rater reliability was achieved between the two judges across the different criteria, one judge proceeded to score the entire national-language sample of responses for that item.

Inter-rater reliability metrics

For all CBM scored items, the following three inter-rater reliability metrics were used:

Correlation coefficient (r) – r > 0.75.
Mean score difference < 1
Absolute value differences exceeding 1 score point < 20 (for criteria of originality and value) and absolute value differences equal to or exceeding 1 score point < 20 (for appropriateness and flexibility).

Any judges participating in the study and responsible for scoring any responses to the PISA 2022 Creative Thinking items met the following criteria:

Engaged in the broader field of creativity research;
Fluent in the local culture (i.e. a native of the country or have lived in the national-language context for a significant period of time leading up to and including the year of the PISA 2022 administration).

Each research team was led by at least one experienced national lead researcher that acted as “lead judge” for the study in their national-language study. The lead researcher participated in anchor response training exercises coordinated by the OECD and were responsible for training other judges participating in their rescoring team, as well as implementing recalibration exercises if needed.

Publications

Featured publications

Data

Featured data

News & events

Featured events

About

Engage with us

Work with us

Publications

Featured publications

Data

Featured data

News & events

Featured events

About

Engage with us

Work with us

Creative minds in action

More info

Cite this content as:

Annex A. PISA Rescoring project: Methods and data sample

Items and data included in the study

Selected items

Data

Scoring methods used in the study

Holistic Judgement Method

Criterion-Based Method

Judges

Criteria to be selected as a judge

Lead judges

Topics

Countries & regions

Data

Publications

News & Events

About

Featured topics

Agriculture and fisheries

Climate change

Development

Digital

Economy

Education and skills

Employment

Environment

Finance and investment

Governance

Health

Industry, business and entrepreneurship

Regional, rural and urban development

Science, technology and innovation

Society

Taxation

Trade

Energy

Nuclear energy

Transport

Featured topics

Agriculture and fisheries

Climate change

Development

Digital

Economy

Education and skills

Employment

Environment

Finance and investment

Governance

Health

Industry, business and entrepreneurship

Regional, rural and urban development

Science, technology and innovation

Society

Taxation

Trade

Energy

Nuclear energy

Transport

Countries A - C

Countries D - I

Countries J - M

Countries N - R

Countries S - T

Countries U - Z

Regional and global engagement

Countries

Countries A - C

Countries D - I

Countries J - M

Countries N - R

Countries S - T

Countries U - Z

Regional and global engagement

Publications

Publications

Featured publications

Data

Data

Featured data

News & events

News & events

Featured events

About OECD

About

Engage with us

Work with us

Featured topics

Agriculture and fisheries

Climate change

Development

Digital

Economy

Education and skills

Employment

Environment

Finance and investment