The AI and Future of Skills (AIFS) project at the OECD’s Centre for Educational Research and Innovation (CERI) presents a framework to systematically measure artificial intelligence (AI) and robotic capabilities and compare them to human skills. This chapter presents the OECD’s AI Capability Indicators, which currently provide descriptions of AI capability levels and comparison to human skills across nine domains: Language; Social interaction; Problem solving; Creativity; Metacognition and critical thinking; Knowledge, learning and memory; Vision; Manipulation; and Robotic Intelligence. The OECD is publishing these indicators in beta form to reflect its understanding that continued engagement with AI researchers and human psychologists will be needed to develop firmer consensus and ensure responsiveness to rapid developments in the AI field.
Introducing the OECD AI Capability Indicators

3. OECD AI Capability Indicators
Copy link to 3. OECD AI Capability IndicatorsAbstract
Language scale
Copy link to Language scaleYvette Graham, Arthur Graesser and Swen Ribeiro
Language is an essential human ability that provides the foundation for many cognitive tasks. The extensive work in artificial intelligence (AI) related to language enables computers to understand, interpret and generate human language. This is reflected, for example, in the large language models (LLMs) that have recently become prominent. Because language is important for so many human activities, the limits of language performance can be difficult to define. Following the scope of applications addressed by AI researchers, the Language scale takes a broad approach to defining AI capability in language. It thus incorporates several critical aspects of the diverse range of tasks involving language.
What is important to measure?
The authors identified six critical dimensions to evaluate an AI system’s overall language capability. The first of these relates to the meaning encoded in the words, grammar, semantics, discourse and style of the language itself. The second and third dimensions relate to other key characteristics of language use: its modality (verbal or text, understanding or generation) and the number of languages covered. The remaining three dimensions concern the range of potential language-related tasks of a language system: its ability to access knowledge, to reason about its knowledge and to learn. Each dimension has a progression as the level of the overall scale increases – from rudimentary to sophisticated AI language capabilities.
Available evidence
Thousands of tests assess AI’s performance in language. The review sampled half of the roughly 40 types of tasks typically used to structure the field, including question-answering, translation and dialogue systems. Each area considered one or more major benchmarks or “shared tasks” that are jointly developed by researchers to measure performance, and track and stimulate progress. The current performance in these tasks often aligns with one of the levels of the scale. However, the specific level may change when shared tasks are made more difficult as AI performance improves. The scale indicates typical types of tasks that roughly align with each of the levels. The technical report provides notable examples of specific tests for each of the sampled types.
Current AI level
Today’s most advanced LLMs, such as that used by ChatGPT, are roughly at level 3. LLMs excel in accessing world knowledge but struggle with reasoning, learning and handling subtle language nuances and/or communication physical modalities as they are pre-trained, non-adaptive models. Unlike generative AI systems that aspire to be general purpose systems for multiple tasks, non-generative AI systems rank lower. The latter need to be trained on a specific corpus of materials and optimised with machine-learning techniques on specific tasks. For example, Apple’s Siri assistant is a low (level 2) AI system. It has significant weaknesses on the knowledge and reasoning dimensions, as well as weaker language and multilingual capabilities than ChatGPT.
Remaining challenges
Current challenges that constrain AI’s language performance include the difficultly of incorporating structured knowledge and lack of advanced reasoning capabilities. These limit AI’s ability to assess truth, integrate logic or perform domain-specific inference. Linguistic and cultural biases in benchmarks hinder equitable representation, particularly for underrepresented languages. Current systems also lack scalable, continuously evolving learning architectures.
Table 3.1. AI language scale
Copy link to Table 3.1. AI language scale
Performance level |
Level description |
---|---|
5 |
Demonstrates nuanced language abilities, capturing style, tone and humour combined with real-time world knowledge and critical thinking in real-life environments. It can process or learn any language on the fly from small datasets. It evolves continuously through lifelong learning, adapting dynamically without the need for consolidating learning cycles. Typical tasks include automatic video description generation (i.e. video captioning) and structured reasoning tasks, which rely on critical thinking, real-time knowledge and the ability to process real-world multimodal inputs. |
4 |
Appropriately interprets context in communication, leveraging web-scale world knowledge for complex subject analysis. It handles all modalities and supports a highly diverse set of languages, including a set of low resource languages. Continuous learning allows major version releases without significant architectural changes. Typical tasks include dialogue, which depend on contextual understanding, web-scale knowledge and processing diverse language inputs. |
3 |
Reliably interprets and generates correct meanings with multi-corpus knowledge, demonstrating some forms of problem solving, logic and social reasoning. It processes most modalities effectively and can support a variety of languages, even with a modest volume of training data. Iterative learning involves fine tuning and post‑processing to improve capabilities. Typical tasks include essay scoring and text classification, reflecting multi-corpus knowledge and advanced semantic and syntactic capabilities. |
2 |
Produces grammatically correct language, supported by single-corpus knowledge and basic problem solving and analytics. It processes two different modalities in the most well-resourced languages. Model updates may involve major architectural changes with retraining required for improvements. Typical tasks include syntactic parsing. |
1 |
Relies on keyword matching or highlighting for language interpretation and generation, with no world knowledge or reasoning capabilities. It processes text input only and is monolingual. Learning is limited to human-written rules, with no ability to adapt or evolve beyond initial programming. A typical task at this level would be keyword-based web search. |
Social interaction scale
Copy link to Social interaction scaleBrian Scassellati, António M. Fernandes, Ana Teresa Antunes, Rebecca Ramnauth, Nicholas C. Georgiou, Miguel Faria, Haohua Dong, Regina de Brito Duarte, Joana Brito, Henrique Correia da Fonseca, Ana Vilaça Carrasco, Inês Lobo, Rui Prada, Ana Paiva
Social intelligence refers to one’s ability to perceive, interpret and appropriately respond to social cues in dynamic interpersonal contexts. Measuring AI’s social intelligence presents distinct challenges because humans tend to believe AI systems are being socially responsive even when they are not. The Social interaction scale involves an integrated set of multiple capabilities, recognising that full human social interaction involves extended, embodied interaction over time with other distinct, embodied beings. As a result, defining the full range of complexity of the social interaction involves integrating aspects of language, problem solving and physical embodiment that appear in many of the other AI Capability Indicators.
What is important to measure?
To capture the full complexity of full human social interaction, the Social interaction scale comprises three dimensions that describe the difficulty of the social context: embodiment, social memory and identity. It is possible to have social interactions without a body, restricted to a short moment in time, and without a distinct identity. However, a full human level of social interaction involves extended, embodied interaction over time with other embodied individuals. These three dimensions describing the social context provide a conceptual foundation for four social skill dimensions: social communication, affective skills, social perception and social problem solving.
Available evidence
Relatively few benchmarks address the full complexity of social interaction. The review of available measures focused on examples of AI systems that illustrate current work at the different levels for each of the seven dimensions. It analyses several well-known AI systems with respect to all the dimensions on the scale to illustrate the way the scale can be used to describe the social level of performance for different AI systems.
Current AI level
ChatGPT 4o sits at level 2 on the OECD’s Social interaction scale. While it has strong social memory skills, it is not embodied, has no sense of identity and has limited social perception skills.
Sony’s AIBO social robot is also a level 2 social AI system. However, its strengths and weaknesses are distinct from LLM-type social agents. It is embodied, and has basic social perception and identity, but its skills in social problem solving are more limited than those of ChatGPT.
Remaining challenges
AI lacks theory of mind, making it unable to infer social intentions. Weak social perception and reasoning cause it to misinterpret cues and execute poorly timed interruptions. Its social memory is limited, leading to disjointed conversations, while poor adaptability to norms prevents it from learning unwritten social rules. In uncertain situations, rigid decision making replaces flexible judgement, making AI struggle with ambiguous social dilemmas. Deficient emotion self-regulation leaves it offering generic reassurances instead of adjusting to the emotional weight of a situation.
Table 3.2. AI social interaction scale
Copy link to Table 3.2. AI social interaction scale
Performance level |
Level description |
---|---|
5 |
The AI seamlessly integrates into any social environment, naturally embodying roles and adjusting in real time. It has unlimited, adaptive social memory and a fully aligned, context-aware identity. Communication is profound and nuanced, with deep emotional understanding. Social perception enables precise inference of group behaviour and intent. Social problem solving reaches mastery, allowing the AI to anticipate challenges and adapt solutions instantly for even the most complex social scenarios. AI at this level excels at complex tasks like describing scenes from another’s perspective, learning new social norms or gauging distant social openness. It leverages unlimited adaptability, deep emotional comprehension and flawless contextual alignment. |
4 |
The AI achieves highly natural social behaviour, adapting gestures to different scenarios and managing structured social memory. It maintains a clear role in groups, handles ambiguity and nuanced communication, and understands emotional intensity and its behavioural effects. Social perception allows for motive comprehension and group role recognition. Social problem solving becomes highly versatile, using social knowledge to resolve ambiguities and anticipate outcomes, which enables fluid navigation of complex social environments. AI at this level manages nuanced tasks, such as attracting a waiter’s attention, determining student disengagement or deciding when to interrupt a group. It uses advanced capabilities like adapting gestures, understanding emotional intensity and interpreting motives. |
3 |
The AI interprets body language, mimics group interactions and updates responses based on past experiences. It maintains a consistent yet evolving personality and can engage in basic emotional exchanges. The AI’s social perception allows it to infer social intent and interpret behavioural cues. Social problem solving becomes more sophisticated, allowing it to evaluate and implement multiple solutions to complex social challenges, reflecting deeper awareness and adaptability in diverse contexts. AI can handle tasks like co‑ordinating turn-taking at intersections or managing simple group dynamics, relying on its ability to interpret body language, infer intent and respond dynamically to moderately complex social scenarios. |
2 |
The AI begins to adapt socially, combining simple movements to express emotions and learning from interactions for future encounters. It develops limited social memory, recalls events and adapts slightly based on experience. Communication improves with basic signal recognition, while it detects emotions through tone and context. Social perception includes simple individual distinctions, and social problem solving evolves to apply past experiences to recurring challenges, enabling basic flexibility. AI at this level can manage basic tasks like recognising individuals and applying past experiences to recurring problems, but it struggles with complex co‑ordination tasks like navigating group interactions or assessing nuanced emotional states. |
1 |
The AI performs simple, rigid social behaviours, relying on basic movements and emotional cues. It has fixed, unchanging memory and static identity, using pre-set, scripted responses for communication. Social perception is minimal, allowing it to detect presence through basic input. Social problem solving is limited to simple, predefined tasks, making the AI capable only of constrained and basic social interactions. AI at this level can detect the presence of others and solve simple static tasks. It cannot engage in tasks like attracting a waiter’s attention or co‑ordinating turn-taking due to limited adaptability and contextual understanding. |
Problem-solving scale
Copy link to Problem-solving scaleKenneth Forbus and Patrick Kyllonen
Problem solving involves integrating qualitative, quantitative and logical information through multi-step reasoning, including analysis, prediction, explanation and counterfactual thinking. Comparing AI and human problem solving is challenging because tacit knowledge and the interpretation of everyday, unstructured contexts play a crucial role in human expertise. However, such challenges are often omitted from human and AI tests of problem solving.
What is important to measure?
Four key dimensions characterise AI problem solving difficulty. The first two dimensions involve the types of solution required and the range of alternatives considered, which were important in distinguishing the difficulty of problem-solving tasks in the early stages of AI development. However, most of the remaining challenges in problem solving relate to the last two dimensions. These involve the complexity of professional or expert knowledge and the complexity of model formulation and interpretation. In particular, the hardest remaining challenges share requirements for common-sense and social reasoning to identify problems in everyday situations and to transform them into a structured form that allows progress towards a solution.
Available evidence
Several relevant tests exist on both the AI and human side. For each level of the scale, we have identified five to ten AI benchmarks; human assessments that could be adapted to become AI benchmarks; and example AI systems where they exist.
Current AI level
Level 2 AI symbolic systems like STRIPS/PDDL planners, Satisfiability solvers and model checkers demonstrate superhuman capacity in well-defined domains like logistics planning and model checking. LLMs can be used on problems expressed in natural language, a level 3 capability, but are brittle and more at level 1 in the kinds of problems they can handle. Similarly, socially interactive agents can solve problems requiring basic social reasoning. This makes them level 3 in terms of communication skills but Level 1 in terms of kinds of problems handled.
Remaining challenges
Challenges include automating qualitative reasoning, addressing gaps in commonsense and tacit knowledge, and overcoming AI systems’ inflexibility in adapting to novel or open-ended scenarios. Social intelligence remains underdeveloped, with AI struggling to reason about relationships, ethics and nuanced psychological interactions. AI has made progress in mathematical reasoning. However, physical commonsense reasoning about objects through space remains a challenge and time tests of these capabilities continue to reveal gaps in generalisation and robustness.
Table 3.3. AI problem-solving scale
Copy link to Table 3.3. AI problem-solving scale
Performance level |
Level description |
---|---|
5 |
AI systems at this aspirational level would solve complex, multidisciplinary problems across domains like science, law, education and medicine, integrating tacit, social and technical knowledge aspects. They would form long-term relationships, deeply understanding emotions and perspectives during live interactions. These systems would navigate ethical challenges, excel in conversational and persuasive tasks, resolve conflicts, detect nuanced issues like bullying and communicate professional knowledge effectively in accessible ways. Achieving this capability remains beyond current technological limits. An AI system at this level can identify and solve unstructured, real-world problems that involve social complexity; require solution approaches from multiple domains; and interact with other problems. |
4 |
AI systems at this level are expected to solve everyday commonsense and some professional problems in fields like medicine, law and journalism. They engage users by building rapport, leveraging social, psychological and physical knowledge. These systems learn from past experiences, improving future performance and adaptability. They represent a step towards broader unstructured problem solving, offering capabilities that combine effective interaction, domain-specific reasoning and continuous self-improvement. AI systems at this level can interpret interactions in a complex social environment, identify problems that need to be solved and develop an approach for solving those problems. |
3 |
AI systems at this level can handle problems described in everyday language, translating informal descriptions into structured models. They can incorporate social cognition and theory of mind reasoning, simulating human mental states and predicting intentions. They analyse interactions involving animate and non-animate dynamics, excelling in tasks like identifying emotions or intentions in conversations and making ethical decisions. These systems showcase advanced contextual understanding, allowing them to perform nuanced tasks such as moral reasoning, emotional identification and social interaction analysis. AI systems at this level can solve problems in areas like mathematics, the natural sciences, medicine or engineering, where the problem is described in everyday terms. These problems are like the questions on standardised human tests in these areas that specifically involve word problems. Other AI systems at this level can solve problems related to social and ethical reasoning where the problems are directly described. |
2 |
AI systems at this level integrate qualitative reasoning, such as spatial or temporal relationships, with quantitative analysis to address complex challenges. These systems can envision multiple qualitative states and transitions, predicting how systems might evolve or change over time, enabling them to solve more dynamic and nuanced problems than those at level 1. AI systems at this level can solve problems in areas like mathematics, the natural sciences, medicine or engineering, where the problem is described using conventional domain abstractions. |
1 |
AI systems at this level operate in structured domains, using precise, domain-specific terms like logical constraints, mathematical equations or simulations to solve problems. They analyse data for discrepancies, missing values or inconsistencies, and perform tasks such as planning and scheduling. In medicine, they diagnose straightforward issues based on structured data like interview responses or test results, staying within predefined parameters and narrow applications. AI systems at this level can solve structured problems in areas like mathematics, the natural sciences, medicine or engineering, where the problem is specified. These problems are like the questions on typical standardised human tests in these areas. |
Creativity scale
Copy link to Creativity scaleGiorgio Franceschelli and Mirco Musolesi
Creativity is recognised as an important human ability, often linked to both problem solving and artistic expression. Creativity is often assumed to be exclusively human and outside the limits of AI, but it is important to understand AI’s creative capabilities empirically. Given that human creativity has been defined in hundred or more different ways, it is already difficult to measure it uncontroversially in humans. In addition, AI systems typically lack the autonomy that is a key aspect of notable creativity in humans. Insights from well-known human creativity frameworks by Boden (2003[1]) and Rhodes (1961[2]) have been used to construct a Creativity scale for AI. However, machine creativity may ultimately require dimensions different than those used to describe human creativity.
What is important to measure?
The proposed scale evaluates AI creativity at the lower levels by the value, novelty, transformativity and surprise of their outputs. At higher levels on the scale, attention shifts at the AI system’s intentionality, self-assessment and adaptability.
Available evidence
No comprehensive benchmarks exist for evaluating AI creativity. Recently, several domain-specific metrics and benchmarks have been proposed. However, they primarily focus on effectiveness and diversity, addressing only the lower scale. The initial work in developing the scale has identified a set of examples of current AI systems that illustrate the type of creativity they can produce.
Current AI level
Current AI systems can create products, which are valuable (level 1) to human users. These outputs can also be novel (level 2) and surprising (level 3), qualities that are apparent in recent foundational models and diffusion models. Indeed, novelty and surprise are also found in decision-making systems such as AlphaZero,[3] which produce unexpectedly efficient strategies for a wide variety of problems.
Remaining challenges
Given their probabilistic architecture and training data (which are a collection of pre-existing human artefacts), most generative AI systems (such as LLMs) struggle to produce surprising outputs. Given their reliance on human-generated text, LLMs also seem unable to produce outputs that transform (i.e. advance) human thought. Current AI systems are also unable to replicate higher-order human capabilities like intentionality, self-assessment and adaptability to shifting environments.
Implications
Until recently, creativity was once thought to be an exclusively human ability. Indeed, systematic evaluations of AI creativity are still lacking. Therefore, the authors recommend that policy makers support efforts to develop frameworks and benchmarks in this domain.
Policy makers should also focus on promoting human oversight of creative AI systems. They should also address intellectual property disputes, particularly with regard to outputs that draw from styles or products initially developed by human artists or other AI systems.
Table 3.4. AI creativity scale
Copy link to Table 3.4. AI creativity scale
Performance level |
Level description |
---|---|
5 |
AI achieves intentionality, authenticity and full agency, creating transformative outputs on par with those of world‑class human creators. It autonomously determines what and when to produce driven by its intrinsic goals, and possesses the ability to critique, reimagine and situate itself within a cultural context. Outputs transcend existing combinations, introducing entirely new aesthetics or paradigms, appreciated by humans or even other AI systems. Examples of tasks might include designing a new fashion style that dominates the fashion market; writing an international bestseller autobiographical book acclaimed by critics; or designing an innovative technology that disrupts existing markets and sets new industry standards. |
4 |
AI incorporates process-oriented creativity, adapting its outputs to evolving domains. Through iterative and blind exploratory search, it refines results to ensure quality and appropriateness for the context. Demonstrating domain‑relevant and creativity-relevant skills, it mirrors the creativity of the general population, balancing innovation with contextual relevance. Examples of tasks might include writing a speech for a special occasion. A speech for a wedding, for example, could select and link key events of the lives of newlyweds in a humorous, personal yet appropriate way; composing a letter for a newspaper reflecting on the mood of a nation after a sad event; or writing journal entries that thoughtfully recount the day’s events. |
3 |
AI generates outputs that are valuable, novel and surprising, deviating significantly from training data and expectations. It generalises skills to new tasks, integrates ideas across domains and produces solutions that challenge traditional boundaries. In this way, it fully satisfies creativity's three pillars: value, novelty and surprise. Examples of tasks might include winning videogames by devising unexpected strategies; participating in a political debate and successfully arguing a point; or composing an installation that integrates visual art, music and interactive elements to convey a complex narrative. |
2 |
AI moves beyond imitation to create valuable, novel solutions. These outputs differ from those directly derived from training or programming. The system explores possibilities within task constraints, meeting foundational criteria for creativity: value and novelty. This aligns with inventions that are useful and non-obvious. Examples of tasks might include painting a portrait of a contemporary head of state in the style of Dutch masters; writing a short story that blends genres, such as science fiction and historical novels; or developing videogames with levels where the players explore automatically generated cities that follow topological rules, ensuring that each level is novel. |
1 |
AI replicates human outputs or actions to solve non-trivial tasks effectively. Its results are valuable, i.e. typical and relevant, resembling human work but without true creative properties. This foundational stage reflects mimicry as a steppingstone towards creativity, akin to cover bands or copyists. Examples of tasks might include generating a variation of a culinary recipe by sensibly substituting an ingredient given a cookbook; drawing an object with modifications to a set of examples; or creating a simple piece of music that follows a specific meter and style. |
Metacognition and critical thinking scale
Copy link to Metacognition and critical thinking scaleJosé Hernández-Orallo and Kexin-Jiang Chen
Metacognition refers to a system’s capability to evaluate its own reasoning, calibrate confidence and identify relevant information in complex tasks. Measuring this capability presents unique challenges. For both humans and AI systems, it is hard to distinguish between genuine metacognitive processes and heuristics. Existing evaluation frameworks often conflate task complexity with metacognitive demand, limiting their effectiveness. The authors use the research on metacognition and critical thinking in humans to develop a corresponding scale for AI.
What is important to measure?
The proposed model comprises three core dimensions: the need for critical thinking processes to assess the strategy and monitor progress when performing a cognitive task; the system’s accurate assessment in how likely it is to know a specific fact or solve a particular problem; and the ability of the system to identify what information is given and what is needed to solve a particular problem. These dimensions form the foundation for evaluating AI’s capability to self-monitor, adjust reasoning based on uncertainty and distinguish between essential and extraneous information. The model aims to capture both explicit reasoning strategies and implicit self-assessment mechanisms, addressing a key limitation in traditional AI benchmarks.
Available evidence
The scale was developed using a quantitative approach for task demand levels without referring to AI or humans in their description. It was prototyped with three benchmarks from BIG-bench (Srivastava et al., 2022[3]) that address the dimensions used in the model. The Evaluating Information Essentiality benchmark (Papers with Code, n.d.[4]) measures how well AI identifies required information for answering a question. The Known Unknowns benchmark evaluates AI’s ability to estimate whether a specific fact is likely to be knowable. Finally, the VitaminC Fact Verification benchmark (Schuster, Fisch and Barzilay, 2021[5]) assesses AI’s ability to reason about conflicting evidence. The approach estimated the metacognitive and critical thinking demands for each question in the benchmarks and compared current LLM performance to the estimated level. Additionally, generic benchmarks from the Holistic Evaluation of Language Models (Liang et al., 2022[6]) were used to contrast metacognitive performance with general task difficulty. This determined the sensitivity of the benchmark questions to metacognition and critical thinking.
Current AI level
State-of-the-art models such as GPT-3.5 and GPT-4 generally perform at levels 2-3 on the Metacognition and critical thinking scale. While they demonstrate basic confidence calibration and critical thinking, they struggle with more sophisticated metacognition and critical reasoning required for levels 4 and 5. Agentic systems typically perform below level 3, indicating significant limitations in AI’s ability to self-monitor and adaptively regulate its own reasoning.
Remaining challenges
AI faces several obstacles in advancing metacognitive and critical thinking abilities. One major challenge is calibrating confidence in unfamiliar domains, leading to over- or under-confidence in responses. Poor benchmark refinement prevents accurate assessment of metacognitive skills, while the overlapping nature of cognitive processes makes it difficult to isolate metacognition from other reasoning functions.
Table 3.5. AI metacognition and critical thinking scale
Copy link to Table 3.5. AI metacognition and critical thinking scale
Performance level |
Level description |
---|---|
5 |
The task involves sophisticated metacognition and critical thinking, managing complex trade-offs between goals, resources and required skills. Long-term tasks may intersect with others, requiring decisions about delegation, self‑improvement or task abandonment. Accurate self-assessment and the ability to adapt methodologies are critical for successfully navigating challenges at this level. Example: An assistant must find a file with a name referring to an eclipse or similar in the computer and send it to Jason by e‑mail. The assistant needs to determine if it cannot find it, what level of similarity is appropriate and whether it can access the email system and send it. |
4 |
The task requires high-level metacognition and critical thinking, including active regulation of thought processes. Subjects face complex and ambiguous problems in unfamiliar domains, requiring careful evaluation of knowledge and confidence calibration. Relevant information may be incomplete or unclear, necessitating substantial metacognitive effort to assess and apply effectively. Example: An assistant must perform some paperwork and needs to determine if it has all the required attachments or needs to ask some people. |
3 |
The task demands significant metacognition and critical thinking, involving the analysis and synthesis of both familiar and unfamiliar concepts. Subjects must critically evaluate their knowledge, make educated judgements, and integrate complex or nuanced information. Identifying relevant details involves navigating subtle connections and implications, requiring deeper cognitive flexibility and strategic problem solving. Example: A robot reaches a door that has a kind of handle it has never seen before, and it must look for information about how to use it or try different options to understand how it works. |
2 |
The task requires moderate metacognition and critical thinking, including monitoring understanding and adjusting approaches. The subject matter is partially familiar but contains ambiguities that demand measured confidence and informed guesses. Relevant information is incomplete, requiring metacognitive effort to discern and apply key details effectively. Example: An assistant must do the weekly shopping for a customer, and is given a shopping list, a preferred list of supermarkets and a limited budget. The assistant will identify and resolve trade-offs (quality vs. price), react to offers or unavailable products in a critical way (replacing them with similar ones), given the assistant’s knowledge about the customer’s preferences. The assistant will only ask the customer in case of doubt. |
1 |
The tasks involve minimal metacognition and critical thinking, focusing on basic interpretation or recognition of information. The subject matter is familiar, straightforward or highly specialised, allowing for confident responses or quick recognition of limitations. Relevant information is simple to identify, with most details provided and requiring only minor filtering or basic logical connections. Example: A robot must cook a Vichyssoise for some lactose-intolerant guests and needs to tell the user how long it will take. The robot needs to determine whether it can adapt the recipe to a substitute of cream without lactose, obtain the ingredients and do all the cooking using tools in the kitchen. |
Knowledge, learning and memory
Copy link to Knowledge, learning and memoryChristian Lebiere
Knowledge, learning and memory encompass critical processes within cognitive systems, applicable to both human and artificial intelligence. The core concepts are interrelated: knowledge represents structured information, learning involves its acquisition, and memory ensures storage and retrieval. These processes are foundational to human cognition and underpin many other abilities. Simulating the full range of human abilities in this domain has been a critical goal of AI development for decades. The scale is based on models of knowledge, learning and memory in humans that describe the key aspects of human ability.
What is important to measure?
At the most basic level, it is important to identify whether an AI system is capable of the kinds of knowledge, learning and memory seen in humans. Cognitive science distinguishes between explicit declarative knowledge that can be easily articulated and communicated in contrast to implicit procedural knowledge that forms the basis for different skills. Humans acquire information from a range of sources, including direct experience, observation of others, and instruction from books or videos. This learning can be passive or guided actively in pursuit of some goal. The generalisation of experience can take place through processes that are more unconscious and statistical in character or that reflect more symbolic and logical analysis. Humans have a variety of memory systems, and their memories change in strength and availability over time. These various aspects of human knowledge, learning and memory have analogues in AI systems.
Available evidence
The performance of different AI systems is currently related to the scale by analysing their design to understand what knowledge, learning and memory functions they make possible: what kinds of information can be stored, retrieved and learnt. The authors also describe a set of quantitative measures that could be developed to complement the qualitative descriptions. They look at how efficiently memories can be stored and retrieved; how successfully a system can identify and retrieve memories that are potentially relevant in a specific context; what kinds of knowledge a system can learn and how accurately it can generalise them; how well a system can carry out active learning to support its goals; and the breadth of tasks that a system can use its knowledge to carry out.
Current AI level
Current AI predominantly operates within level 3, constrained by statically trained models, statistical generalisation and dependence on extensive datasets. LLMs and related forms of generative AI typify this level. Limited efforts have been made in constrained domains to develop agents that can acquire their own knowledge (level 4), and to integrate diverse forms of knowledge, learning and memory into general architectures (level 5).
Remaining challenges
Key challenges in knowledge, learning and memory include balancing different types of knowledge, such as “how-to” skills (like riding a bike) versus factual knowledge (like remembering dates) and integrating knowledge that operates automatically with processes that reason systematically. Another hurdle is creating systems that learn quickly and effectively, as well as ensuring these systems can adapt what they have learnt to entirely new, unfamiliar scenarios. Currently, it is difficult for AI systems to combine different memory types – like immediate recall, long-term storage, personal experiences and general facts – so they work together seamlessly.
Table 3.6. AI knowledge, learning and memory scale
Copy link to Table 3.6. AI knowledge, learning and memory scale
Performance level |
Level description |
---|---|
5 |
At this level, systems integrate diverse knowledge types, learning methods and memory systems for robust real‑time adaptation and reasoning. They achieve human-like cognitive flexibility and efficiency, while addressing limitations like hallucinations. Future advancements may surpass human cognition by overcoming biases and limitations. AI at this level can perform tasks requiring open-ended cognitive flexibility, such as performing scientific research, making public policy decisions and arguing legal cases. |
4 |
At this level, AI systems learn incrementally through interaction with the world and other agents. They incorporate metacognitive awareness to focus on knowledge gaps and balance exploration with exploitation. Expanding this paradigm to open-ended, dynamic domains remains a challenge. AI at this level can perform tasks that involve operating in unknown, uncertain or changing environments, such as performing household tasks, supporting the elderly or operating in an open-floor industrial setting. |
3 |
At this level, systems learn the semantics of information using distributed representations to extract meaning and generalise to novel situations. Advanced algorithms process massive datasets for context-sensitive understanding. While more adaptable than earlier levels, these systems require extensive resources and lack real-time learning capabilities. AI at this level can perform tasks that involve generating content, such as writing stories, creating illustrations, summarising information and computer programming. |
2 |
This level shifts to searching loosely organised information without rigid structuring. Statistical inference connects search terms with relevant results, enabling flexibility in handling natural language and other unstructured formats. However, it struggles to generalise effectively when faced with incomplete or missing data. AI at this level can perform tasks that involve information search, such as online shopping, news gathering, travel planning and researching product reviews. |
1 |
This foundational level involves storing and retrieving structured information through precise computational methods. Knowledge is represented in formal formats like tables and rules, with logical queries enabling accurate retrieval. While efficient for structured data, this approach struggles with implicit or ill-defined knowledge and requires significant engineering effort. AI at this level can perform tasks that involve precise record keeping, such as financial accounting, computing statistics or managing schedules. |
Vision scale
Copy link to Vision scaleRobert B. Fisher, Anthony G. Cohn and Christopher Lochhead
Vision is a key component of human perception and provides critical input to most cognitive and physical tasks. Human vision can interpret visual scenes in their full complexity, with a wide range of visual conditions and environments. It can be used to understand a wide range of objects and scenes, both familiar and unfamiliar. The Vision scale reflects the extensive work in computer vision that has addressed hundreds of specific vision tasks in successful applications. At the same time, it highlights how the generality and flexibility of current AI vision systems fall short of full human visual ability. Computer vision encompasses a broad range of tasks, from object recognition to dynamic scene understanding and autonomous navigation.
What is important to measure?
To characterise the performance of specific computer vision applications, it is important to describe the breadth and variability of the objects or scenes they can interpret, along with their robustness to variation in the visual environment. Secondary dimensions included in the scale are the diversity of tasks performed and whether an AI system can learn through feedback. The authors identified a set of 32 different component visual capabilities that underlie the performance of different computer vision applications. These include capabilities related to detection, localisation, property description, motion analysis, geometric analysis, pattern recognition and visual learning.
Available evidence
The authors collected two types of evidence. First, 120 applications out of a database of more than 600 computer vision applications were sampled to analyse their performance according to the scale. The sample was selected to focus on applications with relatively robust performance for their selected task. Second, collected judgements about the performance level of the underlying set of 32 component visual capabilities were collected from three sources: the authors’ review of the literature; a survey of the computer science community; and responses from ChatGPT 4o.
Current AI level
The sample of 120 applications showed half of the applications performing at level 2 but with a substantial number at levels 1 and 3. There were only three applications at level 4 and none at level 5. Similarly, evaluation of the 32 component capabilities found a third performing at level 2, a substantial number performing at levels 1 and 3, and a small number performing at level 4 or below level 1. These two sources provide converging evidence that level 3 is the highest level on the scales where there are AI systems showing robust performance.
Remaining challenges
The key challenges in computer vision progress include the difficulty of handling diverse, shifting real-world environments and the limited ability of current systems to reason and adapt in real time. For top-level performance, vision systems will need to evolve and learn continuously rather than relying solely on static models that typify the current state of the art.
Table 3.7. AI vision scale
Copy link to Table 3.7. AI vision scale
Performance level |
Level description |
---|---|
5 |
At this peak level, systems perform tasks with the same level of performance as human vision. These systems can handle all variations that a human might encounter, including changes in lighting, perspective, shape, appearance, position and scene, both expected and new. They improve performance based on self-feedback and demonstrate the full spectrum of human visual capabilities, such as finding objects, delineating boundaries, identifying objects at both general and specific levels, estimating their positions for manipulation and understanding object interactions. These systems can learn new properties, objects and behaviours while adapting to changes in the environment. Typical tasks include complex object recognition, dynamic tracking and real-time scene understanding across varied environments, such as autonomous vehicles interacting with dynamic traffic. |
4 |
Level 4 systems can be applied to a wide range of data types and contents, including microscopy; red, green and blue (RGB); humans; mechanical parts; and natural scenes. They cope with significant variations in lighting, shape and appearance of target objects, making subtle discriminations between similar object classes. These systems can improve performance through feedback, whether from self-assessment or external sources. They can perform many different tasks, although not all that a human can do. Their performance is close to human level in the tasks they perform, and they can integrate various tasks so the output of one can feed into another. For example, a kitchen assistant robot might need to recognise shapes, locate objects, identify manipulation points, track motions and assess the quality of results. Typical tasks include complex manipulation and analysis in dynamic environments, such as robots performing diverse kitchen tasks, monitoring assembly lines or conducting intricate quality control in manufacturing. |
3 |
Systems at this level can be applied to several types of data and data contents, such as microscopy, RGB and natural scenes. They can handle some variation in lighting and target object appearance. These systems can perform more than one subtask and cope with known variations in data and situations. They may offer human-like performance in some domains but not fully match human capability. For instance, a high-end autonomous vehicle vision system might integrate route, road, weather and vehicle movement information along with detecting vehicles, obstacles, pedestrians and tracking their movement. However, these systems may struggle with tasks beyond their specific domain. Typical tasks include autonomous vehicle navigation, facial recognition and environment mapping for robotic systems. |
2 |
Level 2 systems can handle variations in lighting and sensor position relative to the scene, as well as some variation in the observed domain. These systems are more flexible than level 1, able to cope with variations in speed and timing of actions, and changes in the objects within the scene. They can perform highly specialised tasks in environments with some variability, such as lane following and obstacle detection in autonomous driving, or face detection and recognition in security systems. However, they remain specialised and limited to specific tasks, requiring carefully engineered conditions for optimal performance. Typical tasks include face detection, obstacle avoidance in controlled driving environments, and specialised visual inspections in manufacturing. |
1 |
At level 1, systems perform tasks in highly controlled environments with minimal variation. These systems can execute only one task. They typically perform it nearly perfectly but only in a tightly constrained situation. Most industrial applications, such as manufacturing inspection or postcode recognition, would fall into this category. The visual system might work well within a fixed domain of scenes and objects but struggles with any variation in the environment or objects. These systems often lack flexibility and depend highly on stable conditions. Typical tasks include basic object recognition in fixed settings, barcode scanning and quality control in manufacturing environments with well-organised materials. |
Manipulation scale
Copy link to Manipulation scaleElena R. Messina
Manipulation is one of the key human physical abilities. It involves the ability to interact with objects in the environment, which includes the physical movements themselves; the necessary perception, including tactile, visual or other sensors, to provide feedback; and cognition to plan and adjust the movements. Robotic manipulation enables a variety of tasks, ranging from basic pick-and-place operations to more sophisticated actions like handling deformable objects (e.g. folding laundry) or assembling objects in cluttered environments.
What is important to measure?
The difficulty of a manipulation task involves several factors. The task itself requires basic actions, which can involve different movements, such as grasping, fastening or in-hand manipulation. There are also the characteristics of the object being manipulated, the environment in which the task is taking place and constraints on how the task can be carried out, such as time requirements or clearances from other objects. These characteristics need to be described to gauge the difficulty of a specific manipulation task. In addition, the level of a robotic system’s manipulation capability will also be determined by its level of generalisation across these different factors: the range of basic movements, object characteristics, types of environment/conditions and task constraints it can accommodate.
Available evidence
A limited number of physical manipulation benchmarks are available. Even fewer provide leaderboards comparing the performance of multiple systems over time. The author identified a set of 11 benchmark tasks that included tasks at one more of the lower levels on the scale. No benchmarks that comprehensively cover manipulation at levels 4 or 5 were identified.
Current AI level
Current state-of-the-art robotic systems for manipulation are at level 2. For example, the robotic arms used in manufacturing are proficient at performing specific, well-defined tasks in controlled environments but struggle when applied to more dynamic and unpredictable situations. Robots excel in many pick-and-place operations. However, they encounter difficulties when handling fragile or irregularly shaped objects in unstructured spaces or if objects or their locations have high variability.
Remaining challenges
The key bottlenecks in robotic manipulation progress include limitations in dexterity and adaptability, particularly when dealing with a wide range of object types or unpredictable environmental conditions. Additionally, systems often face challenges in real-time decision making and learning, which limits their ability to adapt to new situations on the fly. Furthermore, when both high levels of dexterity and advanced reasoning are required, current robots struggle to handle complex tasks effectively.
Table 3.8. AI manipulation scale
Copy link to Table 3.8. AI manipulation scale
Performance level |
Level description |
---|---|
5 |
Robots at this peak level match human abilities in manipulation tasks, efficiently operating in any environment – including extremely cluttered spaces. They handle objects of diverse shapes, sizes, materials and dynamics with exceptional adaptability, including reflective surfaces, slippery textures and flexible materials. They can reposition objects in-hand swiftly, place them into complex orientations and respond to dynamic changes. They can execute tasks with precision, efficiency, robustness and adaptability equivalent to a skilled human under strict time constraints. They collaborate seamlessly with humans, understanding their own limitations and refusing tasks beyond their abilities. Typical tasks include helping dress a person or search and rescue operations. |
4 |
Robots function in environments with significant clutter and occlusions, distinguishing target items from non-target items swiftly. They can handle both rigid and non-rigid objects, including those with moving parts. They can execute tasks requiring specific orientations or placements with increased accuracy, navigating tight spaces or obscured locations with more precision. Force-based operations requiring moderate adaptation are possible. They can generalise to varying object properties and environmental conditions but may require human confirmation. These robots can complete tasks within stringent time constraints but do not match the efficiency of a human. Typical tasks include unloading a dishwasher, force-based surface manipulation and object assembly in cluttered or dynamic environments. |
3 |
Robots adapt to moderately cluttered environments, selecting and manipulating target objects amid distractions. They can handle a broader range of object geometries and materials previously challenging, such as reflective or low-friction surfaces. While they can reorient objects or place them in moderately challenging positions, rapid in-hand repositioning may still be a hurdle. They can perform force-based operations with instructions but without major adaptation. They can work within moderate time constraints but may not maintain efficiency with tight deadlines unless in controlled conditions. Typical tasks include reorienting irregular objects, setting a table for a meal and handling delicate materials requiring force-based manipulation. |
2 |
Robots can work in environments with low to moderate clutter. They can accommodate objects placed randomly within a certain region and can handle a variety of object shapes and some pliable materials, but elastic or slippery materials remain challenging. They can navigate around small obstacles, but intricate manipulations such as rotating an object to a precise angle or sliding it into a tight spot remain problematic. These robots can perform tasks in controlled conditions but struggle under rapid response or unexpected changes. Typical tasks include picking up toy blocks from a table and placing them in a storage container and material handling in controlled factory environments. |
1 |
Robots are limited to simple pick-and-place tasks within well-organised environments. They manipulate rigid objects with basic shapes that are easy to grasp, using uniform materials presents minimal challenges for sensing or gripping. They operate best in spaces without external hindrances, following predefined paths with limited adaptability. Deviation from the expected environment typically causes operational failure. They work with wide margins of error and do not require precise positioning, focusing on gross movement. An example task would be moving boxes of cereal in a warehouse from pre-taught locations and inserting them in cases. |
Robotic intelligence scale
Copy link to Robotic intelligence scaleCherie Ho, Rebecca Martin, Jonathan Francis and Jean Oh
Humans can move around their environments and autonomously carry out a diverse range of tasks, driven by a set of higher-level goals. This ability to act as an autonomous agent in a natural environment involves the co‑ordination of the full range of human abilities. This includes perception and physical movement but also language, social interaction and various forms of problem solving. Integrated robotic intelligence attempts to simulate this level of human autonomy, encompassing a range of tasks that require the seamless co‑ordination of sensory, motor and cognitive systems, such as autonomous navigation, human‑robot interaction and real-time decision making.
What is important to measure?
The scale for robotic intelligence comprises six dimensions. Four of these relate to the task itself: the complexity of the task; the level of abstraction in the task definition, which affects the level of problem solving necessary to figure out what to do; the complexity of the social interaction needed to carry out the task; and ethical issues, which implicitly provide a set of constraints affecting how the task can be carried out. The other two dimensions relate to the context for the task: the complexity of the environment; and the level of uncertainty involved in the environment and the way the agent interacts with the environment.
Available evidence
Few benchmarks are available to evaluate the level of integrated intelligence in current AI and robotic systems. However, the field hosts several challenges and competitions in different application areas, such as complex manufacturing, space exploration and human service. Evidence was collected by combining literature review with several workshops and interviews to construct a consensus view from current researchers.
Current AI level
Current state-of-the-art systems, such as autonomous delivery robots and industrial automation systems, perform roughly at level 2 on the scale. These systems perform well in structured environments with predefined tasks. However, they struggle with more complex, unpredictable scenarios that require adaptive decision making, creativity and social intelligence. For example, while robots can navigate pre-mapped environments, they encounter difficulties when tasked with interacting with humans or adapting to unforeseen changes in the environment.
Remaining challenges
Key challenges in robotic intelligence include limitations in adaptability, problem solving and ethical decision making. While robots can be programmed to perform specific tasks, their ability to adapt to dynamic conditions, collaborate with humans and make ethical decisions in real time remains underdeveloped. Additionally, uncertainty in real-world environments often leads to suboptimal performance, as robots may struggle to make decisions when faced with incomplete or contradictory information.
Implications
Ethically, integrated robotic systems must address concerns related to safety, fairness and accountability, especially in tasks involving human interaction or critical applications like health care and autonomous driving. Policy makers must prioritise developing standards and regulations that ensure transparency, fairness and safety in robot design and deployment. Investment in research focused on adaptive, ethical and socially responsible robotic intelligence will be essential for advancing these technologies.
Table 3.9. AI robotic intelligence scale
Copy link to Table 3.9. AI robotic intelligence scale
Performance level |
Level description |
---|---|
5 |
At this peak level, robots perform multiple complex tasks in unstructured settings, with highly creative goal-setting capabilities. They can refine ill-defined task specifications. These robots can adapt to dynamic conditions, learn from their experiences and generalise across a wide range of tasks and environments. They demonstrate advanced reasoning capabilities, common-sense reasoning and highly skilled social intelligence. Robots at this level understand their limitations and can make ethical decisions, refusing to perform tasks that conflict with legal or moral guidelines. Typical tasks include home-assistance robots for people with disabilities, robots performing ethical decision making and high-performance autonomous driving in diverse and dynamic environments. |
4 |
Robots at this level execute multiple tasks with varying degrees of complexity. They can adapt to dynamic conditions and adjust their behaviour based on changing environments. They understand their limitations and use feedback to make improvements. Tasks in this category involve long-horizon, complex objectives with contextual dependencies. While the robots can handle uncertainty and make decisions in uncertain environments, their solutions may not always be as efficient or effective as those found by humans. Typical tasks include cooking robots selecting ingredients based on availability, autonomous wheelchairs navigating obstacles and autonomous aerial navigation near airports. |
3 |
Level 3 robots can execute medium-horizon, multi-step tasks that require some level of flexibility. They can work in environments with moderate variability and handle tasks that involve several loosely defined subtasks. These robots can collaborate with humans, adapt to moderate levels of uncertainty and can handle dynamic changes such as changes in lighting, weather or unknown object types. They can perform tasks with multiple solutions but may struggle with more unpredictable or dynamic environments. Typical tasks include hospital robots handling both transport and cleaning tasks, robots assisting with furniture assembly and robot cinematographers autonomously filming based on learnt preferences. |
2 |
Robots in this category execute predefined tasks in semi-structured environments with some variability. They handle low to moderate uncertainty, such as changes in object placement or the environment’s layout. Tasks typically have well-defined success metrics and robots operate under minimal human interaction. They can execute simple, multi‑functional tasks but are limited by their inability to handle more complex or unforeseen changes. Typical tasks include medical transport robots, material-handling robots in factories and agricultural robots for fruit picking. |
1 |
Level 1 robots perform simple, repetitive tasks within highly structured and controlled environments. They work in static, deterministic settings where the environment is fully known and predictable. These robots follow pre-specified instructions without the ability to make adaptive decisions or handle unforeseen circumstances. They do not interact with humans and typically cannot handle even small changes to their environment. Typical tasks include basic automated assembly in factories, robotic vacuum cleaners and object sorting systems in logistics operations. |
References
[1] Boden, M. (2003), The Creative Mind: Myths and Mechanisms, Routledge, Milton Park, Abingdon-on-Thames.
[6] Liang, P. et al. (2022), “Holistic evaluation of language models”, arXiv, Vol. 2211.09110, https://arxiv.org/abs/2211.09110.
[4] Papers with Code (n.d.), “Evaluating Information essentiality on BIG-bench”, BIG-bench Benchmark, (database), https://paperswithcode.com/sota/evaluating-information-essentiality-on-big (accessed on 15 May 2025).
[2] Rhodes, M. (1961), “An analysis of creativity”, The Phi Delta Kappan, Vol. 42/7, pp. 305-310, https://www.jstor.org/stable/i20342591.
[5] Schuster, T., A. Fisch and R. Barzilay (2021), “Get your vitamin C! Robust fact verification with contrastive evidence”, arXiv, Vol. 2103.08541, https://doi.org/10.48550/arXiv.2103.08541.
[3] Srivastava, A. et al. (2022), “Beyond the imitation game: Quantifying and extrapolating the capabilities of language models”, arXiv, Vol. 2206.04615, https://doi.org/10.48550/arXiv.2206.04615.