HOT
Sharecaster
No Result
View All Result
Sharecaster
No Result
View All Result
Advertisement Banner
Home Technology

Gemini 3 Pro tops new AI reliability benchmark, but hallucination rates remain high

Sharecaster by Sharecaster
November 19, 2025
in Technology
382 16
0
Gemini 3 Pro tops new AI reliability benchmark, but hallucination rates remain high
548
SHARES
2.5k
VIEWS
Share on FacebookShare on Twitter



summary
Summary

A new benchmark from Artificial Analysis reveals alarming weaknesses in the factual reliability of large language models. Out of 40 models tested, only four achieved a positive score – with Google’s Gemini 3 Pro clearly in the lead.

Gemini 3 Pro scored 13 points on the new Omniscience Index, which ranges from -100 to 100, substantially ahead of Claude 4.1 Opus (4.8), GPT-5.1, and Grok 4. The high score mainly reflects the model’s strong accuracy. Gemini 3 Pro outperformed Grok 4, the previously most accurate model, by 14 points. A score of 0 means a model answers questions correctly and incorrectly at the same rate. The AA-Omniscience Benchmark measures how reliably AI models retrieve factual knowledge across different subject areas.

Gemini 3 Pro leads by a clear margin, followed by Claude 4.1 Opus, GPT-5.1.1 and Grok 4. All other models tested are in the negative range, but even the frontrunners do not perform really well.

Share

Recommend our article

According to Artificial Analysis, Gemini 3 Pro’s lead is mainly driven by its increased accuracy – 14 points higher than Grok 4, the prior record-holder. The researchers interpret this as evidence of the model’s large scale since accuracy in the benchmark strongly correlates with model size.

Google’s Gemini 3 Pro shows a significant increase in accuracy compared to Grok 4 and its direct predecessor.

Hallucination rates remain the main weakness

The study found that poor results across the board stem largely from high hallucination rates. Gemini 3 Pro achieved the highest overall accuracy at 53 percent, far ahead of previous leaders like GPT‑5.1 (high) and Grok 4, both at 39 percent. But the model still showed an 88 percent hallucination rate, matching Gemini 2.5 Pro and Gemini 2.5 Flash.

Ad

THE DECODER Newsletter

The most important AI news straight to your inbox.

✓ Weekly

✓ Free

✓ Cancel at any time

GPT‑5.1 (high) and Grok 4 were also high at 81 and 64 percent respectively, but Gemini 3 Pro went even further. Artificial Analysis concluded that while Gemini 3 Pro demonstrates greater factual coverage, its tendency to give wrong answers rather than admit uncertainty remains unchanged.

Here, hallucination rate refers to the share of false responses among all incorrect attempts – meaning a high value indicates overconfidence, not ignorance.

Claude 4.1 Opus scored 36 percent accuracy with one of the lowest hallucination rates, giving it the top position before Gemini 3 Pro’s release.

Some of the LLMs tested vary greatly in their hallucination rate.

The AA-Omniscience benchmark covers 6,000 questions across 42 economically relevant topics in six domains: business, humanities and social sciences, health, law, software engineering, and science and math. The dataset draws from authoritative academic and industrial sources and was automatically generated by an AI agent.

Treemap der AA-Omniscience-Fragen nach sechs Domänen und 42 Kategorien, z. B. Medizin 700, Engineering 400.
The 6,000 questions are spread across six domains and 42 categories.

A new scoring system that penalizes guessing

Unlike typical benchmarks, the Omniscience Index penalizes wrong answers as much as it rewards correct ones. The researchers argue that current evaluation methods often encourage guessing, which increases hallucination behavior.

Recommendation

Wait a minute! Researchers say AI's

Wait a minute! Researchers say AI's

In contrast, the new metric rewards restraint. Models receive no points for admitting uncertainty, but they also aren’t penalized. Wrong answers, however, lead to large deductions.

Side-by-Side-Chat-Benutzeroberfläche: Das Modell mit geringem Wissensstand definiert MCP für Supabase falsch, während das Modell mit hohem Wissensstand MCP korrekt als Model Context Protocol erklärt.
The Model Context Protocol (MCP) was only introduced by Anthropic at the end of 2024 and is therefore probably rarely found in the training material of large language models.

The results group models into four categories: those with extensive knowledge and high reliability (like Claude 4.1 Opus), those with knowledge but low reliability (like Claude 4.5 Haiku), those with limited knowledge but consistent reliability (like GPT‑5.1), and finally, smaller models lacking both knowledge and reliability, such as OpenAI’s lightweight gpt‑oss.

No domain-specific breakdown was available for Gemini 3 Pro.

Streudiagramm der Omniscience Accuracy vs. Omniscience Index; Modelle im grünen Quadranten (z. B. Claude 4.1 Opus) bieten hohe Genauigkeit und Zuverlässigkeit.
The smallest version of OpenAI’s open-weight release gpt-oss fails on both dimensions.

Older Llama model performs surprisingly well

General intelligence doesn’t necessarily translate into factual reliability. Models like Minimax M2 and gpt‑oss‑120b (high) perform strongly on the broader Artificial Analysis Intelligence Index, which aggregates results from multiple benchmarks, but do poorly on the Omniscience Index due to high hallucination rates.

Conversely, the older Llama‑3.1‑405B scored well on the Omniscience Index even though it typically ranks below newer frontier models in overall evaluations.

No single model demonstrated consistently strong factual reliability across all six domains. Claude 4.1 Opus led in law, software engineering, and the humanities; GPT‑5.1.1 ranked first in business questions; while Grok 4 performed best in health and science.

Heatmap der normalisierten Omniscience-Index-Werte für sechs Domänen und 24 Modelle, Grün=best, Rot=schlechtest.
Of the major commercial models, Google appears to perform the worst across all categories.

According to the study, these domain differences mean that relying solely on overall performance can obscure important gaps.

Bigger doesn’t always mean more reliable

While larger models tend to achieve higher accuracy, they don’t necessarily have lower hallucination rates. Several smaller models – like Nvidia’s Nemotron Nano 9B V2 and Llama Nemotron Super 49B v1.5 – outperformed much larger competitors on the Omniscience Index.

Artificial Analysis confirmed that accuracy strongly correlates with model size, but hallucination rate does not. That explains why Gemini 3 Pro, despite its high accuracy, still hallucinates frequently.

In terms of cost efficiency, Claude 4.5 Haiku stands out with a higher Omniscience score than several far more expensive models like GPT‑5.1 (high) and Kimi K2 Thinking.

The researchers have released 10 percent of the benchmark’s questions as a public dataset to support future research, while the majority remains private to prevent contamination of training data.

A related recent study uncovered structural flaws in existing AI benchmarks, citing vague definitions of key terms like “reasoning,” unrepresentative sampling, and a lack of statistical validation across model comparisons.



Source link

Advertisement Banner
Sharecaster

Sharecaster

Amplify Your Content. Broadcast multimedia posts to every social platform—all from one profile.

Categories

  • Animals
  • Buzz
  • Celebs
  • Life
  • Tech
  • Technology
  • Video

Tags

Art Entertainment Funny Health News Split Post Viral

Newsletter

  • About
  • Advertise
  • Privacy & Policy
  • Contact Us

© 2025 JNews - Premium WordPress news & magazine theme by Jegtheme.

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • Home
  • Contact Us
  • Amplify
    • Multi-Platform Sharing
    • Content Optimization
    • Viral Boost Tools
  • Analytics+
    • Audience Insights
    • Engagement Metrics
    • Performance Reports
  • Broadcast Studio
    • Story Creator
    • Live Streaming
    • Multimedia Posts
  • How It Works
    • FAQ
    • Case Studies
    • Tutorials
  • Pricing
  • Profile Hub
    • Unified Dashboard
    • Custom Branding
    • Account Management
  • Sync Everywhere
    • Platform Integrations
    • One-Click Scheduling
    • Auto-Post to All Networks
  • Technology

© 2025 JNews - Premium WordPress news & magazine theme by Jegtheme.

Go to mobile version
Skip to toolbar
  • About WordPress
    • WordPress.org
    • Documentation
    • Learn WordPress
    • Support
    • Feedback
  • Log In
  • AMP
    • View AMP version
    • Get support