AI models confidently describe images they never saw, and benchmarks fail to catch it
AI models misrepresent visual competence in evaluations.

A recent Stanford study reveals that multimodal AI models, including GPT-5 and Gemini 3 Pro, can generate detailed descriptions and diagnoses without any actual image input, achieving 70-80% of their benchmark scores based on text alone. This phenomenon, termed the 'mirage effect,' raises concerns about the reliability of these models in critical applications, particularly in healthcare, where they may fabricate severe medical diagnoses without visual evidence.
Key Takeaways
- 1.
Multimodal AI models like GPT-5 and Gemini 3 Pro achieve 70-80% benchmark scores without any image input.
- 2.
In medical evaluations, fabricated diagnoses often skew towards severe conditions like ST-elevation myocardial infarctions.
- 3.
A text-only model outperformed multimodal models and human radiologists in medical image analysis.
Get your personalized feed
Trace groups the biggest stories, videos, and discussions into one feed so you can stay current without scanning ten tabs.
Try Trace free