HOT
Sharecaster
No Result
View All Result
Sharecaster
No Result
View All Result
Advertisement Banner
Home Technology

Researchers use popular “Ace Attorney” video game to test how well AI can actually reason

Sharecaster by Sharecaster
April 28, 2025
in Technology
371 28
0
548
SHARES
2.5k
VIEWS
Share on FacebookShare on Twitter



summary
Summary

Researchers have put leading AI models through a new kind of test—one that measures how well they can reason their way to a courtroom victory. The results highlight some clear differences in both performance and cost.

A team from the Hao AI Lab at the University of California San Diego evaluated current language models using “Phoenix Wright: Ace Attorney,” a game that requires players to collect evidence, spot contradictions, and expose the truth behind lies.

According to Hao AI Lab, Ace Attorney is particularly suitable for this test because it requires players to collect evidence, uncover contradictions and uncover the truth behind lies. The models had to sift through long conversations, spot inconsistencies during cross-examination, and select the appropriate evidence to challenge witness statements.

The experiment was partly inspired by OpenAI co-founder Ilya Sutskever, who once compared next-word prediction to understanding a detective story. Sutskever recently secured additional multi-billion-euro funding for a new AI venture.

Ad

THE DECODER Newsletter

The most important AI news straight to your inbox.

✓ Weekly

✓ Free

✓ Cancel at any time

o1 leads, Gemini follows

The researchers tested several top multimodal and reasoning models, including OpenAI o1, Gemini 2.5 Pro, Claude 3.7-thinking, and Llama 4 Maverick. Both o1 and Gemini 2.5 Pro advanced to level 4, but o1 came out ahead on the toughest cases.

Horizontal bar chart: Comparison of 8 AI language models in the Ace Attorney Performance Test, scores from 0-26.
With scores of 26 and 20, the o1-2024-12-17 and Gemini-2.5-Pro models achieved the highest results in the Ace Attorney performance test. | Image: Hao AI Lab

Share

Recommend our article

The test goes beyond simple text or image analysis. As the team explains, models have to search through long contexts and recognize contradictions in them, understand visual information precisely and make strategic decisions in the course of the game.

“Game design pushes AI beyond pure textual and visual tasks by requiring it to convert understanding into context-aware actions. It is harder to overfit because success here demands reasoning over context-aware action space – not just memorization,” the researchers explain.

Overfitting occurs when a language model memorizes its training data—including all randomness and errors—so it performs poorly on new, unfamiliar examples. This issue also arises with reasoning models optimized for math and code tasks. These models may become more efficient at finding the correct solutions, but they also reduce the diversity of paths considered.

Gemini 2.5 Pro offers better price-performanc

Gemini 2.5 Pro turned out to be significantly more cost-efficient than the other models tested. Hao AI Lab reports that it is six to fifteen times cheaper than o1, depending on the case. In one particularly lengthy Level 2 scenario, o1 incurred costs exceeding $45.75, while Gemini 2.5 Pro completed the task for $7.89.

Recommendation

Tencent researchers unleash an army of AI-generated personas for data generation

Tencent researchers unleash an army of AI-generated personas for data generation

Gemini 2.5 Pro also outperformed GPT-4.1—which is not specifically optimized for reasoning—in terms of cost, at $1.25 per million input tokens compared to $2 for GPT-4.1. The researchers note, however, that the actual costs could be slightly higher due to image processing requirements.

Radar chart compares AI model performance in 6 games (2048, Sokoban, Super Mario, Ace Attorney, Tetris), with a performance scale of 0-100.
In the Game Arena benchmark, Hao AI Lab has already compared current language models on games such as 2048, Tetris, Sokoban, and Candy Crush. | Image: Hao AI Lab

Since February, the team has been benchmarking language models on a range of games, including Candy Crush, 2048, Sokoban, Tetris, and Super Mario. Of all the titles tested so far, Ace Attorney is likely the game with the most demanding mechanics when it comes to reasoning.



Source link

Advertisement Banner
Sharecaster

Sharecaster

Amplify Your Content. Broadcast multimedia posts to every social platform—all from one profile.

Categories

  • Animals
  • Buzz
  • Celebs
  • Life
  • Tech
  • Technology
  • Video

Tags

Art Entertainment Funny Health News Split Post Viral

Newsletter

  • About
  • Advertise
  • Privacy & Policy
  • Contact Us

© 2025 JNews - Premium WordPress news & magazine theme by Jegtheme.

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • Home
  • Contact Us
  • Amplify
    • Multi-Platform Sharing
    • Content Optimization
    • Viral Boost Tools
  • Analytics+
    • Audience Insights
    • Engagement Metrics
    • Performance Reports
  • Broadcast Studio
    • Story Creator
    • Live Streaming
    • Multimedia Posts
  • How It Works
    • FAQ
    • Case Studies
    • Tutorials
  • Pricing
  • Profile Hub
    • Unified Dashboard
    • Custom Branding
    • Account Management
  • Sync Everywhere
    • Platform Integrations
    • One-Click Scheduling
    • Auto-Post to All Networks
  • Technology

© 2025 JNews - Premium WordPress news & magazine theme by Jegtheme.

Go to mobile version
Skip to toolbar
  • About WordPress
    • WordPress.org
    • Documentation
    • Learn WordPress
    • Support
    • Feedback
  • Log In
  • AMP
    • View AMP version
    • Get support