ARC-AGI-2 Sets the Bar for General Intelligence: A New AGI Benchmark Challenges AI Models:

Read Time: 2 minutes

ARC-AGI-2, the latest benchmark from the Arc Prize Foundation, measures AI’s reasoning ability and efficiency. Most AI models, including GPT-4.5 and Claude 3.7, scored around 1%, far below human performance.

The Arc Prize Foundation, co-founded by AI researcher François Chollet, has introduced a new benchmark called ARC-AGI-2 to evaluate artificial general intelligence (AGI). Unlike its predecessor, this test measures both accuracy and efficiency, posing a significant challenge for AI models across the industry.

Understanding ARC-AGI-2

The ARC-AGI-2 test consists of visual puzzles where AI models must analyze grids of different-colored squares to identify patterns and generate correct outputs. Designed to measure a model’s ability to adapt and reason, it discourages reliance on massive computing power, which was a flaw in the previous ARC-AGI-1 test.

According to the Arc Prize leaderboard, models like OpenAI’s o1-pro and DeepSeek’s R1 scored between 1% and 1.3%, while others like GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Flash hovered around 1%. In contrast, human participants averaged 60% accuracy.

Addressing Previous Flaws

Unlike ARC-AGI-1, the new benchmark prioritizes efficiency. Models are assessed not only on their ability to solve problems but also on how resource-efficient they are. ARC-AGI-2 minimizes the role of brute-force computation, making it a more accurate reflection of a model’s true intelligence.

Greg Kamradt, Arc Prize Foundation co-founder, explained:

“Intelligence is not solely defined by problem-solving ability. Efficiency in acquiring and applying skills is crucial in evaluating AI capabilities.”

Performance Highlights

Even OpenAI’s o3 (low) model, which scored 75.7% on ARC-AGI-1, managed only 4% on ARC-AGI-2 while spending approximately $200 per task. This result highlights the demanding nature of the new benchmark.

Performance Comparison:

o1-pro: 1.3%
R1: 1.2%
GPT-4.5: 1%
Claude 3.7 Sonnet: 1%
Gemini 2.0 Flash: 1%
Humans: 60%

Industry Response

The ARC-AGI-2’s introduction has sparked conversations across the AI community. Industry leaders, including Thomas Wolf, co-founder of Hugging Face, have emphasized the need for more robust benchmarks to measure AGI advancements.

To encourage innovation, the Arc Prize Foundation has announced the Arc Prize 2025 Contest. The challenge? Achieve 85% accuracy on ARC-AGI-2 using no more than $0.42 per task. This initiative aims to drive progress in the development of efficient, capable AI systems.