OpenAI disclosed that its latest reasoning AI models—o3 and o4‑mini—hallucinate more frequently than previous reasoning and non‑reasoning systems, with 33% and 48% hallucination rates respectively on the PersonQA benchmark. This rise in fabricated outputs occurs even as these models outperform predecessors in coding, mathematics, and visual comprehension tasks.
Model Hallucination Rates Compared
PersonQA Benchmark Results
-
o3: 33% hallucination rate, 59% accuracy
-
o4‑mini: 48% hallucination rate, 36% accuracy
-
o1: 16% hallucination rate, 47% accuracy
-
o3‑mini: 14.8% hallucination rate.
Wider Benchmark Observations
Prior iterations consistently reduced hallucinations; GPT‑4.5, for instance, records a 19% rate on PersonQA, compared to GPT‑4o’s 30%.
Investigating the Causes
OpenAI’s technical report admits “more research is needed” to pinpoint why larger reasoning models are hallucinating more.
A leading hypothesis is that reinforcement learning techniques used for o‑series models may amplify unintended behaviors that were partially suppressed in earlier post‑training pipelines.
Implications for Enterprise Use
High hallucination rates erode trust in mission‑critical applications—legal drafting, financial analysis, or clinical decision support—where factual accuracy is non‑negotiable. Organizations must weigh performance gains against risk of inaccuracy when integrating these models into production workflows.
Potential Mitigations
-
Web Search Integration: Models with browsing capabilities (e.g., GPT‑4o with web search) achieve up to 90% accuracy on SimpleQA, suggesting search could reduce hallucinations at the cost of exposing prompts to third‑party services.
-
System‑Level Verification: Implement automated fact‑checking layers or external QA tools to validate model outputs before downstream use.
-
Selective Model Choice: Reserve o3/o4‑mini for tasks where creative reasoning outweighs strict accuracy and use lower‑hallucination models for critical data‑sensitive operations.
Conclusion
While o3 and o4‑mini push the envelope in AI reasoning and multimodal capabilities, their increased hallucination rates present a significant challenge for enterprise adoption. Addressing these inaccuracies—through web‑enabled checks, hybrid model strategies, or enhanced training protocols—will be crucial as organizations seek to leverage reasoning AI without compromising on reliability.