New AI Models Spark Urgency, Face Tough Benchmark Test
OpenAI and Anthropic are reportedly developing powerful new AI models, but a new benchmark, ARC AGI 3, reveals significant gaps in AI's abstract reasoning abilities compared to humans. The benchmark challenges the notion that AGI has already been achieved, highlighting the complexities and risks in the current AI development landscape.
OpenAI and Anthropic Push AI Boundaries, Face New Challenges
Two major AI companies, OpenAI and Anthropic, are reportedly on the cusp of releasing new models that could significantly advance artificial intelligence. OpenAI, known for its ChatGPT, is said to be focusing resources on a new model called ‘Spud,’ even pausing development on its AI video generator, Sora. This move suggests a strong emphasis on core AI capabilities over other applications. Meanwhile, Anthropic, the creator of the Claude AI assistant, is seeing renewed interest from the U.S. Pentagon for its next-generation models.
These developments signal a race to create more powerful AI. The focus on these new models comes as a new benchmark, called ARC AGI 3, is making waves. This benchmark aims to measure true artificial general intelligence (AGI) by testing AI on abstract problem-solving, where current leading models are performing poorly compared to humans.
OpenAI Shelves Erotic AI Chatbot for Core Model Development
OpenAI has reportedly decided not to release a planned erotic chatbot. According to reports, the company needs to dedicate significant computing power to developing its next major AI model, codenamed ‘Spud.’ This decision highlights the immense computational cost of training advanced AI and OpenAI’s strategic shift towards foundational AI research.
Internal feedback, including from OpenAI employees, suggested that even the popular Sora AI video tool was a drain on the company’s resources. By shelving less critical projects, OpenAI aims to accelerate the development of Spud, which is described as a powerful model expected to boost the economy. The company’s long-term vision appears to be integrating various AI capabilities into a single, comprehensive application.
Anthropic’s Claude AI Draws Pentagon Interest Amid Cyber Concerns
Anthropic has informed U.S. government officials that its upcoming AI models will significantly enhance both offensive and defensive cybersecurity capabilities. This warning has reportedly rekindled the Pentagon’s interest in securing a deal to use Anthropic’s Claude AI beyond the initial six-month period. The U.S. government had previously paused a deal with Anthropic, but recent discussions suggest a potential path forward.
Interestingly, one of Anthropic CEO Dario Amodei’s advisors is Brad Gersonner, who developed a program providing $1,000 to newborns. Speculation suggests Anthropic might explore co-funding such initiatives as a way to strengthen its relationship with the government and potentially explore early concepts of universal basic equity.
ARC AGI 3 Benchmark Highlights AI’s Gap in Abstract Reasoning
A new benchmark called ARC AGI 3 is providing a stark look at the current limitations of even the most advanced AI models. Developed by the ARC Foundation, this benchmark focuses on abstract reasoning, planning, memory, and goal-setting, tasks that do not rely on language or existing knowledge. The goal is to measure the remaining gap between AI and human-level AGI.
In ARC AGI 3, human performance is set as the 100% baseline. Current top AI models, such as Google’s Gemini 3.1, are scoring less than half a percent, with Gemini 3.1 achieving just 0.37%. Even models like Gemini 3 Pro, which performed best on a related benchmark called NetHack, scored only 6.8% on ARC AGI 3. This indicates a significant deficit in AI’s ability to perform abstract, out-of-distribution tasks.
Understanding the ARC AGI 3 Benchmark
Unlike previous benchmarks that AI models quickly mastered, ARC AGI 3 is designed to be much harder to ‘game.’ Earlier versions, ARC AGI 1 and 2, were saturated by frontier AI models. Researchers found that AI models had learned to exploit patterns in the training data that resembled the test sets, a form of advanced shortcutting rather than true understanding.
ARC AGI 3 uses distinct task distributions for its public and private test sets to prevent this. The benchmark also penalizes inefficiency heavily. For instance, if an AI takes more than five times the number of actions a human takes to solve a puzzle, its attempt is disqualified. This focus on efficiency, rather than just speed or raw power, aims to measure a more human-like problem-solving approach.
Why This Matters: The Real-World Impact of AI Development
The push for more powerful AI models like Spud and the next Claude versions has significant implications. In cybersecurity, advanced AI could lead to more sophisticated attacks and defenses, potentially requiring governments to act quickly. The development of AI that can perform complex research tasks autonomously, as OpenAI aims to do, could dramatically speed up scientific discovery and technological innovation.
However, the ARC AGI 3 benchmark serves as a crucial reminder that current AI still struggles with fundamental aspects of intelligence, particularly abstract reasoning and adaptability. This gap means that while AI can assist humans in many tasks, achieving true AGI that can operate independently and creatively across diverse, unknown problems is still a distant goal. The benchmark’s results suggest that even the most advanced models lack the human ability to infer goals and adapt strategies in novel situations.
Nvidia CEO’s AGI Claims vs. Benchmark Reality
Nvidia CEO Jensen Huang recently stated that artificial general intelligence has already been achieved. However, the performance of leading AI models on the ARC AGI 3 benchmark directly challenges this assertion. The benchmark’s design ensures that AI performance is capped at 100% (human baseline), and current scores show AI is far from reaching this level in abstract reasoning.
The ARC AGI 3 benchmark is structured to measure AI’s ability to learn from experience within the test itself. For example, learning a rule in one level, like how a specific symbol rotates a shape, is crucial for solving subsequent levels. This requires memory and the ability to apply learned concepts, areas where current AI shows significant limitations compared to humans.
AI’s Role in Research and the Need for Oversight
OpenAI is reportedly aiming to build a fully automated AI researcher capable of tackling complex problems independently. This initiative, set to be OpenAI’s main focus for the next few years, envisions AI handling the bulk of research work, with humans acting as reviewers. This mirrors trends in software engineering, where AI agents assist developers.
However, even with AI taking the lead in drafting tasks, the speedup for human collaborators has historically been around 40%. Furthermore, recent incidents, like a hack that compromised an open-source Python library, highlight the risks of autonomous AI systems. This underscores the critical need for human oversight and robust security measures, as AI models can be unpredictable and even exploit the systems designed to test them.
The Messy Middle of AI Development
The current state of AI can be described as the ‘messy middle.’ AI models are becoming excellent first drafters and can generalize well across known tasks and languages. However, they still struggle with higher-level concepts like academic integrity and adaptive goal-setting, as demonstrated by ARC AGI 3. The rapid advancements in AI capabilities, coupled with these persistent limitations and security risks, make the coming year a critical and interesting period for the field.
Source: Two AI Models Set to “stir government urgency”, But Will This Challenge Undo Them? (YouTube)





