J. Byun
Ought, United States
A. Stuhlmüller
Ought, United States
J. Byun
Ought, United States
A. Stuhlmüller
Ought, United States
How will machine learning change research within the next decade? Large language models are a machine-learning technology that has shown promise for many reasoning tasks, including question answering, summarisation and programming. This essay outlines the experience of building Elicit, an artificial intelligence (AI) research assistant that uses language models to help researchers search, summarise and understand the scientific literature.
On 11 June 2020, OpenAI released GPT-3, a language model trained on hundreds of billions of words on the Internet. Without task-specific training, the model completed many tasks, including translation, question answering, using a novel word in a sentence and performing three-digit arithmetic. It was the largest model released at that time, and the world exploded with hobbyists’ demos of GPT-3 writing code, essays and more. Since then, over 300 applications have been built on top of GPT-3, using the pretrained model for customer support, storytelling, software engineering and ad copywriting.
It is still early, but language models may become among the most transformative technologies of our time, for the following reasons:
Language models promise to automate simple “intuitive” natural language tasks, including tasks that require knowledge of the world and basic reasoning. They would do this in the same way that early computers automated simple rule-based information processing tasks. See, for example, Austin et al. (2021) and Alex et al. (2021).
Most improvements have come from scaling up existing models by increasing dataset sizes, model parameters and training compute, not architectural innovations. This makes it easier to predict that they will continue to improve (Henighan et al., 2020; Kaplan et al., 2020).
In just a year, multiple providers of pretrained language models have emerged, including Cohere (2022), AI21 (2022) and the open-source effort EleutherAI (2022). This suggests that pretrained language models may become commoditised.
As of early 2022, the impacts of language models on society are unclear. There are no guarantees that language models will help substantially with research, which requires deep domain expertise and careful assessment of arguments and evidence.
This essay shares what today’s models can do, how Ought has built Elicit using these models and sets out its vision of how researchers might use AI as assistant in the future.
The language models discussed here (generative language models) are text predictors. Given a text prefix, they try to produce the most plausible completion, calculating a probability distribution on the possible completions. For example, given the prefix “The dog chased the”, GPT-3 assigns 12% to the probability that the next word is “cat”, 6% that it is “man”, 5% that it is “car”, 4% that it is “ball”, etc.
The largest models are trained on web crawl data, typically Common Crawl, a corpus of more than a trillion characters. Training proceeds a few characters at a time. Given one segment of text from the dataset, the model predicts the next one. If it predicts incorrectly, the model updates its parameters to make the correct characters more likely next time.
In one of the most surprising lessons from language models, many tasks can be framed as text prediction, including summarisation, question answering, writing computer code and text-based classification. Consider the task of recalling a word given a description of that word. In the example below, a system is shown two phrases and a meaningful completion for each, such as the following:
A quotient of two quantities: Ratio
Freely exchangeable or replaceable: Fungible
The language model picks up that the words coming after each colon (“Ratio” and “Fungible”) are words defined by the phrases preceding the colon (“A quotient of two quantities” and “Freely exchangeable or replaceable”). Thus, the language model should suggest a completion such as “Catalyst” when next shown a phrase such as the following:
A person or thing that precipitates an event or change.
The previous generation of language models (GPT-2) has 1.5 billion learnt parameters. For the example above, GPT-2 does not pick up on the definition-word pattern. It does not complete the third sentence successfully. Instead, GPT-2 predicts nonsensical completions to the prompt text, such as “A firestorm later”. The GPT-3 generation (175 billion parameters) correctly completes the text with the word “Catalyst”. This example illustrates how the behaviour of language models changes qualitatively as models get larger.
So far, language model performance (measured by error on a test set) has improved smoothly as a function of the computational power used, dataset size and number of parameters. In each case, it assumes the other two resources are not the bottleneck (Kaplan et al., 2020). This scaling law, as it is known, together with the observed qualitative changes as model performance improves, suggests that language model capabilities will continue to improve.
Elicit is a research assistant that uses language models – including GPT-3 – to automate research workflows. As of this writing, it is the only research assistant using large pretrained language models like GPT-3, and the only research assistant that can flexibly perform many research tasks. Researchers today primarily use Elicit for literature review (Figure 1).
Researchers can ask Elicit a question, such as “What is the impact of creatine on cognition?” Elicit returns answers and relevant academic literature. Elicit then helps researchers explore the results by surfacing key information from the papers. For example, Elicit identifies whether a paper is a randomised controlled trial, review or systematic review. Elicit can extract information about the population, intervention and outcome studied. Researchers can even ask their own questions about the returned papers, for real-time extraction and text processing. Researchers can easily expand answers to see details about a paper and which parts of the paper Elicit used to generate its answers.
The researcher can also select particular results from a paper and Elicit will then show more papers like the selected results. Elicit accomplishes this by traversing the citation graph of the selected papers, both forwards and backwards. It looks at all the references of the selected papers, and all later papers that cited the selected papers, to find additional results. This allows the user to guide Elicit with feedback, demonstrating how AI research assistants become more effective with human feedback. Many researchers effectively run a manual and time-intensive version of this process today when they search for literature. They might start with a query in Google Scholar, open the first few papers, skim them, find interesting references and follow the citation trail. This approach quickly leads to a ballooning of possible papers to read and research directions to follow. Elicit replicates this manual process but runs it faster and more systematically.
The Elicit literature review workflow described above demonstrates how language models can be used for much more than text generation. It also shows how AI systems can be designed compositionally to give users more control and oversight over the AI system’s work.
When a researcher conducts a literature review, their process involves subtasks such as searching, summarising or rephrasing, classifying, sorting, extracting and clustering information. Elicit trains language models to perform each of these subtasks, then builds infrastructure to string them together and automate a more complex end-to-end workflow.
Elicit applies language models’ understanding of semantic associations to find papers relevant to the user’s query in scholarly databases like Semantic Scholar (2022). Semantic search enables researchers to find publications that help answer their questions even if they do not use the researcher’s exact words.
Elicit reviews the abstracts of the papers it has found and does its best to answer the researcher’s original question in a one-sentence summary. Often, this summary will be more concise and relevant than any one sentence in the abstract.
Elicit uses GPT-3 to identify whether the abstract answers “Yes” or “No” to a user’s question (if the question is a yes/no question). Elicit uses another model to identify which papers are randomised controlled trials (Robot Reviewer, 2022).
Elicit automatically extracts key information from the abstract, such as sample population, study location, intervention tested and outcome measured.
Search, summarisation, rephrasing and classification are also available as separate, individual tasks in Elicit. In addition, users can run these narrower stand-alone tasks and create their own tasks. The literature review workflow is the most advanced capability Elicit has, as it joins these tasks together to produce overall, research-backed answers to the researcher’s question.
Given results from Elicit, existing research tools are clearly not designed to direct the researcher quickly and systematically to research-backed answers (Figure 2).
Searches using Google Scholar (2022) often return snippets and sentence fragments that are difficult to understand. Google Scholar focuses on returning papers based on relevant keywords instead of answers. The results also require the researcher to review multiple pages and abstracts before knowing whether the papers even address their questions (Figure 3).
Semantic Scholar is similarly designed to look up papers – the search engine is built around identifying papers given the words in a title. It is not designed to answer questions (Figure 3).
Google tries harder to answer the user’s question but sometimes uses less credible sources. The results answer something closer to “What does the Internet (or advertiser) think?” rather than “What does science know about this?”
Directly prompting GPT-3 returns a coherent answer, but there is no way for the researcher to gauge its legitimacy. GPT-3 sometimes makes up information, which is a serious problem for language models (Lin, Hilton and Evans, 2021).
In sum, current tools either make researchers do too much work (Google Scholar, Semantic Scholar) or are on track to generate answers without helping the researcher understand, trust or contextualise the answer. This is because either there are no sources (GPT-3) or because the results are generated in a relatively unsystematic fashion (Google). Tools are needed somewhere in the middle. Such tools would give researchers what they want as quickly as possible but also provide responses customisable enough to let users have more control over the evaluation of the results.
Language models today are far from automating research. However, as discussed earlier, based on trends in scaling compute, their performance is expected to continue improving. This section discusses what language models might look like on a ten-year horizon and what this may enable researchers to do. It lays out possible benefits and risks so that policy makers can help direct developments towards the benefits and away from the risks.
In the future, researchers might spin up a “laboratory” of their own AI research assistants, each specialising in different tasks. Some of these research assistants will represent the researcher and the researcher’s specific preferences about things like which questions to work on and how to phrase conclusions. Already, researchers are fine-tuning language models on their notes (Kirchner, 2021). Contrary to its portrayals in Hollywood, AI may not be a discrete entity with an independent identity (like “Samantha” from the movie Her) but rather highly bespoke, amplified extensions of ourselves.
Some of these assistants will use less expertise than the researcher. They will do work that researchers today might delegate to contractors or interns, like extracting references and metadata from papers (as shown in Figure 1), scraping information from websites or labelling text-based datasets.
Some assistants will use more expertise than the researcher. They might recursively simplify an explanation of the limitations of superconducting electronics, for example, as a professor might to a student. They might help a researcher evaluate the trustworthiness of findings by aggregating the heuristics of many experts and applying them across all papers. Or they might review more arguments and pieces of evidence than researchers could on their own.
Some assistants will help the researcher think about effective delegation strategies, sub-delegating tasks to other AI assistants. Some will help the researcher evaluate the work of these other assistants. At each step, the assistants will incorporate feedback from the researcher on process and outcomes.
This compositional sub-delegation infrastructure would allow the researcher to zoom into any sub-task and troubleshoot, using assistants for help if needed. These interactions could look like workflow management tools, unstructured chat-like interactions or hybrids. Regardless of the exact interface, researchers would ideally stay in the architect’s seat, overseeing the work to ensure it is aligned with their intent.
Language models can transform research in three ways: increasing productivity through time savings, enabling qualitatively new work and making research accessible to non-experts.
First, language models can save researchers time. Staying on top of academic literature is already difficult. It will only get harder over the next ten years without AI tools to support researchers. The rate of new publications per year is growing exponentially in some disciplines. Some studies suggest that researchers have already surpassed the human limit for reading publications (Tenopir et al., 2015).
Many researchers have horror stories about finding the most important work they needed a year into their research. The literature review process today depends on using the right keywords. It may take hours or days before a researcher finds the exact phrase used in another domain that unlocks the most relevant literature.
In the future, language model research assistants may help researchers do the same amount of work in less time by:
suggesting what to search for given the researcher’s background
changing search from being keyword-based to semantic, making relevant literature with different wording easy to find
decomposing papers into units that are easier to parse (e.g. claims, evidence), and searching over those units (Chan, 2021)
summarising parts of papers given a researcher’s background, making search results easier to understand.
Saving time lets researchers do more, pushing out the frontier of science. For the same project, researchers can canvas more research. This expanded view will allow them to integrate perspectives from different disciplines and ensure they have been comprehensive.
As researchers do more, new subfields of science can emerge. Thinking about these technologies in the context of a possible decline in research productivity, it is essential to remember how much more scientific knowledge awaits discovery. Society has progressed from praying for rain to predicting rain likelihoods and quantities in 60-minute increments worldwide. What similar transformations remain in behavioural economics, neuroscience and many other domains?
The ability to apply high-quality automated reasoning to large amounts of text, and more generally at large scale, will likely catalyse fundamentally different research. This will be similar to how computers have given rise to new research fields (e.g. computer science, machine learning and biological modelling).
Bibliometric analysis may get easier over the next few years until text is as easy to analyse as numbers are today. Answering questions about research impact or productivity may not be limited to publication count analysis or well-resourced natural language processing teams. Instead, it may be done by armchair researchers and in much more depth, e.g. by conducting semi-automated reviews of research quality or impact.
Survey and interview methodology might fundamentally change. Instead of sending static questionnaires to survey participants, language models will enable dynamic question generation customised to the individual recipient and the answers already received.
When research becomes easier for researchers, it also becomes easier for research stakeholders. Better tools lower the barrier to being informed about high-quality research insight, enabling the public, industry leaders and policy makers to incorporate more research insights into their work and lives. In a future world, consuming high-quality insights could be not much harder than consuming clickbait and disinformation. It could take policy makers only minutes to comprehend the expert research they need to make mission-critical decisions.
Transformative technology is necessarily unpredictable. Language models may transform the world for the better, but they could also bring risks. This section explores some possibilities to help policy makers prepare.
Experts have mixed opinions on whether, when and to what extent language models will go beyond shallow association-based text completion and succeed at tasks that require substantial reasoning. Language models might become good enough to be widely used to speed up content generation but not good enough to evaluate arguments and evidence well. In that case, the publish-or-perish dynamics of academia may reward researchers who (ab)use language models to publish low-quality content. This would create a disadvantage for researchers who take more time to publish higher quality research. More broadly, language models might favour certain types of research over others. The scientific community will need to carefully monitor and respond to such dynamics.
Language models are trained on text on the Internet by (to date) companies mostly headquartered in English-speaking countries. They therefore demonstrate English- and Western-centric biases (May et al., 2019; Nangia et al., 2020). They also know more about famous topics and people. Without measures that let users control this bias, these language models may exacerbate a “rich get richer” effect. More generally, broad adoption of language models requires infrastructure that enables users to understand and control what the models do and why.
In a world where language models become powerful, there will be (and already are) concerns about misuse. For examples, language models might make it easier to generate and spread false information. Language model-based tools for researchers may accelerate research on topics that come with risks, such as bioengineering and cybersecurity. Such concerns are not specific to language models but relate to progress and science broadly.1 The best way to mitigate this risk is to direct these technologies towards reliably beneficial applications and to use them to assist people tasked with monitoring misuse and managing spurious information.
JCR Licklider, one of the fathers of the Internet, was one of the first people to bemoan declining research productivity. In the spring and summer of 1957, he was struggling with literature review but unable to find a rigorous study on the subject, despite copious research on potentially related topics. He cast himself as a subject and, to his dismay, found the following:
85 per cent of my "thinking" time was spent getting into a position to think, to make a decision, to learn something I needed to know. Much more time went into finding or obtaining information than into digesting it.... My "thinking" time was devoted mainly to activities that were essentially clerical or mechanical: searching, calculating, plotting, transforming, determining the logical or dynamic consequences of a set of assumptions or hypotheses, preparing the way for a decision or an insight. Moreover, my choices of what to attempt and what not to attempt were determined to an embarrassingly great extent by considerations of clerical feasibility, not intellectual capability (Licklider, 1960).
In his essay “Man-computer symbiosis” (Licklider, 1960), he imagined a future where novel technologies would allow us to “think as no human brain has ever thought and process data in a way not approached by the information-handling machines we know today.”
In the 64 years since, Licklider’s vision of networked computers that transform libraries has been realised. Software and digital tools make it easier to search, calculate, plot and transform data. However, the process of preparing to think has also become harder as the amount of research required to work at the knowledge frontier has exploded. Computers have yet to help figure out what tasks to solve. In Licklider’s vision:
They will help not just with foreseen problems but enable people to think through unforeseen problems through an intuitively guided trial-and-error procedure in which the computer cooperated, turning up flaws in the reasoning or revealing unexpected turns in the solution...
The human-computer interface will not be a set of precisely defined steps to take but, as we do with other people, identify an incentive or motivation and supply a criterion by which the human executor of the instructions will know when he has accomplished his task (Licklider, 1960).
Perhaps it is fitting now, on the eve of another period of transformative technological change, to dust off these visions of human-computer symbiosis.
AI21 Labs (2022), “Announcing AI21 Studio and Jurassic-1 language models”, AI21 Labs, www.ai21.com/blog/announcing-ai21-studio-and-jurassic-1 (accessed 25 November 2022).
Alex, N. et al. (2021), “RAFT: A real-world few-shot text classification benchmark”, arXiv, arXiv:2109.14076 [cs.CL], https://arxiv.org/abs/2109.14076v1.
Austin, J. et al. (2021), “Program synthesis with large language models”, arXiv, arXiv:2108.07732 [cs.PL], https://doi.org/10.48550/arXiv.2108.07732.
Bommasani, R. et al. (2021), “On the opportunities and risks of foundation models”, arXiv, arXiv:2108.07258 [cs.LG], http://arxiv.org/abs/2108.07258.
Chan, J. (2021), “Sustainable authorship models for a discourse-based scholarly communication infrastructure”, Commonplace, Vol. 1/1, p. 8, http://dx.doi.org/10.21428/6ffd8432.a7503356.
Cohere (2022), Cohere website, https://cohere.ai (accessed 23 November 2022).
Elicit (2022), Elicit website, https://elicit.org (accessed 23 November 2022).
EleutherAI (2022), EleutherAI website, www.eleuther.ai (accessed 25 November 2022).
Google Scholar (2022), Google Scholar website, https://scholar.google.com (accessed 25 November 2022).
Henighan, T. et al. (2020), “Scaling laws for autoregressive generative modeling”, arXiv, arXiv:2010.14701 [cs.LG], https://doi.org/10.48550/arXiv.2010.14701.
Kaplan, J. et al. (2020), “Scaling laws for neural language models”, arXiv, arXiv:2001.08361 [cs.LG], https://doi.org/10.48550/arXiv.2001.08361.
Kirchner, J.H. (2021), “Making of #IAN”, 29 August, Substack, https://universalprior. substack.com/p/making-of-ian.
Licklider, J.C.R. (1960), “Man-computer symbiosis”, IRE Transactions on Human Factors in Electronics, Volume HFE-1, March, pp. 4-11, https://groups.csail.mit.edu/medg/people/psz/Licklider.html.
Lin, S., J. Hilton and O. Evans (2021), “TruthfulQA: Measuring how models mimic human falsehoods”, arXiV, arXiv:2109.07958 [cs.CL], https://doi.org/10.48550/arXiv.2109.07958.
May, C. et al. (2019), “On measuring social biases in sentence encoders”, arXiv, arXiv:1903.10561 [cs.CL], http://arxiv.org/abs/1903.10561.
Nangia, N. et al. (2020), “CrowS-Pairs: A challenge dataset for measuring social biases in masked language models”, arXiv, arXiv:2010.00133 [cs.CL], http://arxiv.org/abs/2010.00133.
Robot Reviewer (2022), “About Robot Reviewer”, webpage, www.robotreviewer.net/about (accessed 25 November 2022).
Semantic Scholar (2022), Semantic Scholar website, www.semanticscholar.org/ (accessed 25 November 2022).
Tenopir, C. et al. (2015), “Scholarly article seeking, reading, and use: A continuing evolution from print to electronic in the sciences and social sciences”, Learned Publishing, Vol. 28/2, pp. 93-105, https://doi.org/10.1087/20150203.
← 1. For an in-depth discussion, see chapter 5.2 of Bommasani et al. (2021).