8 C
United Kingdom
Monday, September 22, 2025

AGI Benchmarks: Tracking Progress Toward AGI Isn’t Easy


Buzzwords in the field of artificial intelligence can be technical: perceptron, convolution, transformer. These refer to specific computing approaches. A recent term sounds more mundane but has revolutionary implications: timeline. Ask someone in AI for their timeline, and they’ll tell you when they expect the arrival of AGI—artificial general intelligence—which is sometimes defined as AI technology that can match the abilities of humans at most tasks. As AI’s sophistication has scaled—thanks to faster computers, better algorithms, and more data—timelines have compressed. The leaders of major AI labs, including OpenAI, Anthropic, and Google DeepMind, have recently said they expect AGI within a few years.

A computer system that thinks like us would enable close collaboration. Both the immediate and long-term impacts of AGI, if achieved, are unclear, but expect to see changes in the economy, scientific discovery, and geopolitics. And if AGI leads to superintelligence, it may even affect humanity’s placement in the predatory pecking order. So it’s imperative that we track the technology’s progress in preparation for such disruption. Benchmarking AI’s capabilities allows us to shape legal regulations, engineering goals, social norms, and business models—and to understand intelligence more broadly.

While benchmarking any intellectual ability is tough, doing so for AGI presents special challenges. That’s in part because people strongly disagree on its definition: Some define AGI by its performance on benchmarks, others by its internal workings, its economic impact, or vibes. So the first step toward measuring the intelligence of AI is agreeing on the general concept.

Another issue is that AI systems have different strengths and weaknesses from humans, so even if we define AGI as “AI that can match humans at most tasks,” we can debate which tasks really count, and which humans set the standard. Direct comparisons are difficult. “We’re building alien beings,” says Geoffrey Hinton, a professor emeritus at the University of Toronto who won a Nobel Prize for his work on AI.

Undaunted researchers are busy designing and proposing tests that might lend some insight into our future. But a question remains: Can these tests tell us if we’ve achieved the long-sought goal of AGI?

Why It’s So Hard to Test for Intelligence

There are infinite kinds of intelligence, even in humans. IQ tests provide a kind of summary statistic by including a range of semirelated tasks involving memory, logic, spatial processing, mathematics, and vocabulary. Sliced differently, performance on each task relies on a mixture of what’s called fluid intelligence—reasoning on the fly—and crystallized intelligence—applying learned knowledge or skills.

For humans in high-income countries, IQ tests often predict key outcomes, such as academic and career success. But we can’t make the same assumptions about AI, whose abilities aren’t bundled in the same way. An IQ test designed for humans might not say the same thing about a machine as it does about a person.

There are other kinds of intelligence that aren’t usually evaluated by IQ tests—and are even further out of reach for most AI benchmarks. These include types of social intelligence, such as the ability to make psychological inferences, and types of physical intelligence, such as an understanding of causal relations between objects and forces or the ability to coordinate a body in an environment. Both are crucial for humans navigating complex situations.

An old black-and-white photograph shows a man in a long coat standing next to a horse. Propped up against a wall are several blackboards with mathematical notations.  Clever Hans, a German horse in the early 1900s, seemed able to do math—but was really responding to his trainer’s subtle cues, a classic case of misinterpreting performance. Alamy

Intelligence testing is hard—in people, animals, or machines. You must beware of both false positives and false negatives. Maybe the test taker appears smart only by taking shortcuts, like Clever Hans, the famous horse that appeared to be capable of math but actually responded to nonverbal cues. Or maybe test takers appear stupid only because they are unfamiliar with the testing procedure or have perceptual difficulties.

It’s also hard because notions of intelligence vary across place and time. “There is an interesting shift in our society in terms of what we think intelligence is and what aspects of it are valuable,” says Anna Ivanova, an assistant professor of psychology at Georgia Tech. For example, before encyclopedias and the Internet, “having a large access to facts in your head was considered a hallmark of intelligence.” Now we increasingly prize fluid over crystallized intelligence.

The History of AI Intelligence Tests

Over the years, many people have presented machines with grand challenges that purported to require intelligence on par with our own. In 1958, a trio of prominent AI researchers wrote, “Chess is the intellectual game par excellence.… If one could devise a successful chess machine, one would seem to have penetrated to the core of human intellectual endeavor.” They did acknowledge the theoretical possibility that such a machine “might have discovered something that was as the wheel to the human leg: a device quite different from humans in its methods, but supremely effective in its way, and perhaps very simple.” But they stood their ground: “There appears to be nothing of this sort in sight.” In 1997, something of this sort was very much in sight when IBM’s Deep Blue computer beat Garry Kasparov, the reigning chess champion, while lacking the general intelligence even to play checkers.

A man leans over a chess board on a desk, studying the pieces. On the other side of the board is a man looking at a computer screen. IBM’s Deep Blue defeated world chess champion Garry Kasparov in 1997, butdidn’t have enough general intelligence to play checkers. Adam Nadel/AP

In 1950, Alan Turing proposed the imitation game, a version of which requires a machine to pass as a human in typewritten conversation. “The question and answer method seems to be suitable for introducing almost any one of the fields of human endeavour that we wish to include,” he wrote. For decades, passing what’s now called the Turing test was considered a nearly impossible challenge and a strong indicator of AGI.

But this year, researchers reported that when people conversed with both another person and OpenAI’s GPT-4.5 for 5 minutes and then had to guess which one was human, they picked the AI 73 percent of the time. Meanwhile, top language models frequently make mistakes that few people ever would, like miscounting the number of times the letter r occurs in strawberry. They appear to be more wheel than human leg. So scientists are still searching for measures of humanlike intelligence that can’t be hacked.

The ARC Test for AGI

There’s one AGI benchmark that, while not perfect, has gained a high profile as a foil for most new frontier models. In 2019, François Chollet, then a software engineer at Google and now a founder of the AI startup Ndea, released a paper titled “On the Measure of Intelligence.” Many people equate intelligence to ability, and general intelligence to a broad set of abilities. Chollet takes a narrower view of intelligence, counting only one specific ability as important—the ability to acquire new abilities easily. Large language models (LLMs) like those powering ChatGPT do well on many benchmarks only after training on trillions of written words. When LLMs encounter a situation very unlike their training data, they frequently flop, unable to adjust. In Chollet’s sense, they lack intelligence.

To go along with the paper, Chollet created a new AGI benchmark, called the Abstraction and Reasoning Corpus (ARC). It features hundreds of visual puzzles, each with several demonstrations and one test. A demonstration has an input grid and an output grid, both filled with colored squares. The test has just an input grid. The challenge is to learn a rule from the demonstrations and apply it in the test, creating a new output grid.

Two examples show small colorful shapes on black grids labeled \u201cinput\u201d and, on grids labeled \u201coutput,\u201d those same shapes at a larger scale and now interlocking. The test shows another input grid with small shapes, and a blank output grid.     The Abstraction and Reasoning Corpus challenges AI systems to infer abstract rules from just a few examples. Given examples of input-output grids, the system must apply the hidden pattern to a new test case—something humans find easy but machines still struggle with. ARC Prize

ARC focuses on fluid intelligence. “To solve any problem, you need some knowledge, and then you’re going to recombine that knowledge on the fly,” Chollet told me. To make it a test not of stored knowledge but of how one recombines it, the training puzzles are supposed to supply all the “core knowledge priors” one needs. These include concepts like object cohesion, symmetry, and counting—the kind of common sense a small child has. Given this training and just a few examples, can you figure out which knowledge to apply to a new puzzle? Humans can do most of the puzzles easily, but AI struggled, at least at first. Eventually, OpenAI created a version of its o3 reasoning model that outperformed the average human test taker, achieving a score of 88 percent—albeit at an estimated computing cost of US $20,000 per puzzle. (OpenAI never released that model, so it’s not on the leaderboard chart.)

This March, Chollet introduced a harder version, called ARC-AGI-2. It’s overseen by his new nonprofit, the ARC Prize Foundation. “Our mission is to serve as a North Star towards AGI through enduring benchmarks,” the group announced. ARC Prize is offering a million dollars in prize money, the bulk going to teams whose trained AIs can solve 85 percent of 120 new puzzles using only four graphics processors for 12 hours or less. The new puzzles are more complex than those from 2019, sometimes requiring the application of multiple rules, reasoning for multiple steps, or interpreting symbols. The average human score is 60 percent, and as of this writing the best AI score is about 16 percent.

Two charts show different AI models\u2019 performance on the ARC-AGI-1 and ARC-AGI-2 tests, with the score on the x axis and the cost per task on the y axis.\u00a0 AI models have made gradual progress on the first version of the ARC-AGI benchmark, which was introduced in 2019. This year, the ARC Prize launched a new version with harder puzzles, which AI models are struggling with. Models are labeled low, medium, high, or thinking to indicate how much computing power they expend on their answers, with “thinking” models using the most.ARC Prize

AI experts acknowledge ARC’s value, and also its flaws. Jiaxuan You, a computer scientist at the University of Illinois at Urbana-Champaign, says ARC is “a very good theoretical benchmark” that can shed light on how algorithms function, but “it’s not taking into account the real-world complexity of AI applications, such as social reasoning tasks.”

Melanie Mitchell, a computer scientist at the Santa Fe Institute, says it “captures some interesting capabilities that humans have,” such as the ability to abstract a new rule from a few examples. But given the narrow task format, she says, “I don’t think it captures what people mean when they say general intelligence.”

Despite these caveats, ARC-AGI-2 may be the AI benchmark with the biggest performance gap between advanced AI and regular people, making it a potent indicator of AGI’s headway. What’s more, ARC is a work in progress. Chollet says AI might match human performance on the current test in a year or two, and he’s already working on ARC-AGI-3. Each task will be like a miniature video game, in which the player needs to figure out the relevant concepts, the possible actions, and the goal.

What Attributes Should an AGI Benchmark Test?

Researchers keep rolling out benchmarks that probe different aspects of general intelligence. Yet each also reveals how incomplete our map of the territory remains.

One recent paper introduced General-Bench, a benchmark that uses five input modalities—text, images, video, audio, 3D—to test AI systems on hundreds of tasks that demand recognition, reasoning, creativity, ethical judgment, and other abilities to both comprehend and generate material. Ideally, an AGI would show synergy, leveraging abilities across tasks to outperform the best AI specialists. But at present, no AI can even handle all five modalities.

Other benchmarks involve virtual worlds. An April paper in Nature reports on Dreamer, a general algorithm from Google DeepMind that learned to perform over 150 tasks, including playing Atari games, controlling virtual robots, and obtaining diamonds in Minecraft. These tasks require perception, exploration, long-term planning, and interaction, but it’s unclear how well Dreamer would handle real-world messiness. Controlling a video game is easier than controlling a real robot, says Danijar Hafner, the paper’s lead author: “The character never falls on his face.” The tasks also lack rich interaction with humans and an understanding of language in the context of gestures and surroundings. “You should be able to tell your household robot, ‘Put the dishes into that cabinet and not over there,’ and you point at [the cabinet] and it understands,” he says. Hafner says his team is working to make the simulations and tasks more realistic.

Aside from these extant benchmarks, experts have long debated what an ideal demonstration would look like. Back in 1970, the AI pioneer Marvin Minsky told Life that in “three to eight years we will have a machine with the general intelligence of an average human being. I mean a machine that will be able to read Shakespeare, grease a car, play office politics, tell a joke, have a fight.” That panel of tasks seems like a decent start, if you could operationalize the game of office politics.

Virtual people would be assigned randomized tasks that test not only understanding but values. For example, AIs might unexpectedly encounter money on the floor or a crying baby.

One 2024 paper in Engineering proposed the Tong test (tong is Chinese for “general”). Virtual people would be assigned randomized tasks that test not only understanding but values. For example, AIs might unexpectedly encounter money on the floor or a crying baby, giving researchers the opportunity to observe what the AIs do. The authors argue that benchmarks should test an AI’s ability to explore and set its own goals, its alignment with human values, its causal understanding, and its ability to control a virtual or physical body. What’s more, the benchmark should be capable of generating an infinite number of tasks involving dynamic physical and social interactions.

Others, like Minsky, have suggested tests that require interacting with the real world to various degrees: making coffee in an unfamiliar kitchen, turning a hundred thousand dollars into a million, or attending college on campus and earning a degree. Unfortunately, some of these tests are impractical and risk causing real-world harm. For example, an AI might earn its million by scamming people.

I asked Hinton, the Nobel Prize winner, what skills will be the hardest for AI to acquire. “I used to think it was things like figuring out what other people are thinking,” he said, “but it’s already doing some of that. It’s already able to do deception.” (In a recent multi-university study, an LLM outperformed humans at persuading test takers to select wrong answers.) He went on: “So, right now my answer is plumbing. Plumbing in an old house requires reaching into funny crevices and screwing things the right way. And I think that’s probably safe for another 10 years.”

Researchers debate whether the ability to perform physical tasks is required to demonstrate AGI. A paper from Google DeepMind on measuring levels of AGI says no, arguing that intelligence can show itself in software alone. They frame physical ability as an add-on rather than a requirement for AGI.

Mitchell of the Santa Fe Institute says we should test capabilities involved in doing an entire job. She noted that AI can do many tasks of a human radiologist but can’t replace the human because the job entails a lot of tasks that even the radiologist doesn’t realize they’re doing, like figuring out what tasks to do and dealing with unexpected problems. “There’s such a long tail of things that can happen in the world,” she says. Some robotic vacuum cleaners weren’t trained to recognize dog poop, she notes, and so they smeared it around the carpet. “There’s all kinds of stuff like that that you don’t think of when you’re building an intelligent system.”

Some scientists say we should observe not only performance but what’s happening under the hood. A recent paper coauthored by Jeff Clune, a computer scientist at the University of British Columbia, in Canada, reports that deep learning often leads AI systems to create “fractured entangled representations”—basically a bunch of jury-rigged shortcuts wired together. Humans, though, look for broad, elegant regularities in the world. An AI system might appear intelligent based on one test, but if you don’t know the system’s innards, you could be surprised when you deploy it in a new situation and it applies the wrong rule.

AGI Is Already Here, and Never Will Be

The author Lewis Carroll once wrote of a character who used a map of the nation “on the scale of a mile to the mile!” before eventually using the country as its own map. In the case of intelligence testing, the most thorough map of how someone will perform in a situation is to test them in the situation itself. In that vein, a strong test of AGI might be to have a robot live a full human life and, say, raise a child to adulthood.

“Ultimately, the real test of the capabilities of AI is what they do in the real world,” Clune told me. “So rather than benchmarks, I prefer to look at which scientific discoveries [AIs] make, and which jobs they automate. If people are hiring them to do work instead of a human and sticking with that decision, that’s extremely telling about the capabilities of AI.” But sometimes you want to know how well something will do before asking it to replace a person.

We may never agree on what AGI or “humanlike” AI means, or what suffices to prove it. As AI advances, machines will still make mistakes, and people will point to these and say the AIs aren’t really intelligent. Ivanova, the psychologist at Georgia Tech, was on a panel recently, and the moderator asked about AGI timelines. “We had one person saying that it might never happen,” Ivanova told me, “and one person saying that it already happened.” So the term “AGI” may be convenient shorthand to express an aim—or a fear—but its practical use may be limited. In most cases, it should come with an asterisk, and a benchmark.

From Your Site Articles

Related Articles Around the Web

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles