OpenAI’s o3 and o4‑mini–Reasoning Models Exhibit Increased Hallucination

Read Time: 2 minutes

OpenAI’s latest reasoning models, o3 and o4‑mini, hallucinate up to 48% on PersonQA, outpacing older versions. Enterprises must temper performance benefits with robust validation and model selection strategies.

OpenAI disclosed that its latest reasoning AI models—o3 and o4‑mini—hallucinate more frequently than previous reasoning and non‑reasoning systems, with 33% and 48% hallucination rates respectively on the PersonQA benchmark. This rise in fabricated outputs occurs even as these models outperform predecessors in coding, mathematics, and visual comprehension tasks.

Model Hallucination Rates Compared

PersonQA Benchmark Results

o3: 33% hallucination rate, 59% accuracy
o4‑mini: 48% hallucination rate, 36% accuracy
o1: 16% hallucination rate, 47% accuracy
o3‑mini: 14.8% hallucination rate.

Wider Benchmark Observations

Prior iterations consistently reduced hallucinations; GPT‑4.5, for instance, records a 19% rate on PersonQA, compared to GPT‑4o’s 30%.

Investigating the Causes

OpenAI’s technical report admits “more research is needed” to pinpoint why larger reasoning models are hallucinating more.
A leading hypothesis is that reinforcement learning techniques used for o‑series models may amplify unintended behaviors that were partially suppressed in earlier post‑training pipelines.

Implications for Enterprise Use

High hallucination rates erode trust in mission‑critical applications—legal drafting, financial analysis, or clinical decision support—where factual accuracy is non‑negotiable. Organizations must weigh performance gains against risk of inaccuracy when integrating these models into production workflows.

Potential Mitigations

Web Search Integration: Models with browsing capabilities (e.g., GPT‑4o with web search) achieve up to 90% accuracy on SimpleQA, suggesting search could reduce hallucinations at the cost of exposing prompts to third‑party services.
System‑Level Verification: Implement automated fact‑checking layers or external QA tools to validate model outputs before downstream use.
Selective Model Choice: Reserve o3/o4‑mini for tasks where creative reasoning outweighs strict accuracy and use lower‑hallucination models for critical data‑sensitive operations.

Conclusion

While o3 and o4‑mini push the envelope in AI reasoning and multimodal capabilities, their increased hallucination rates present a significant challenge for enterprise adoption. Addressing these inaccuracies—through web‑enabled checks, hybrid model strategies, or enhanced training protocols—will be crucial as organizations seek to leverage reasoning AI without compromising on reliability.