AI-Generated X-Rays Deceive Radiologists and Top-Tier LLMs in New Study
Key Takeaways
- A study published in Radiology reveals that synthetic X-ray images created by AI tools like ChatGPT and RoentGen can successfully fool experienced medical professionals and advanced AI models.
- The findings highlight critical vulnerabilities in medical imaging, ranging from legal fraud to systemic cybersecurity risks in healthcare infrastructure.
Mentioned
Key Intelligence
Key Facts
- 117 radiologists from 12 hospitals in 6 countries participated in the study
- 2Only 41% of radiologists spontaneously identified AI-generated images when uninformed
- 3Radiologist accuracy rose to 75% after being told the dataset contained synthetic images
- 4AI detection accuracy for GPT-4o, GPT-5, Gemini 2.5 Pro, and Llama 4 Maverick ranged from 57% to 85%
- 5GPT-4o failed to detect all deepfakes even though it was the model that created them
| Group | ||
|---|---|---|
| Radiologists (Uninformed) | 41% | Spontaneous identification during routine review |
| Radiologists (Informed) | 75% | Actively looking for synthetic images |
| Top-Performing AI (GPT-4o) | 85% | Highest detection rate among tested LLMs |
| Low-Performing AI | 57% | Lower bound of LLM detection capability |
Who's Affected
Analysis
The emergence of high-fidelity synthetic media has moved beyond deepfake videos into the sensitive realm of medical diagnostics. A recent study led by Dr. Mickael Tordjman of the Icahn School of Medicine at Mount Sinai demonstrates that AI-generated X-rays are now indistinguishable from real patient data to the untrained—and even trained—eye. This isn't just a technical curiosity; it represents a fundamental threat to the integrity of digital medical records and the trust that underpins modern healthcare systems globally.
The study's methodology was rigorous, involving 17 radiologists from 12 hospitals across six countries. They were presented with 264 X-ray images, half of which were generated by AI tools such as ChatGPT and RoentGen. The results were startling: when the radiologists were unaware of the study’s true purpose, only 41 percent spontaneously identified the AI-generated images. Even after being warned that the dataset contained synthetic images, their mean accuracy only rose to 75 percent. This suggests that even when clinicians are actively looking for fakes, one in four synthetic images can still pass as real, potentially leading to misdiagnosis or unnecessary medical interventions.
The study tested four of the most advanced large language models currently available: OpenAI’s GPT-4o and GPT-5, Google’s Gemini 2.5 Pro, and Meta’s Llama 4 Maverick.
Perhaps more concerning is the performance of the AI models themselves. The study tested four of the most advanced large language models currently available: OpenAI’s GPT-4o and GPT-5, Google’s Gemini 2.5 Pro, and Meta’s Llama 4 Maverick. Their accuracy in detecting the fake images ranged from 57 to 85 percent. In a particularly ironic twist, GPT-4o—the very model used to create some of the deepfakes—failed to detect all of them. While it outperformed the other models, its inability to reliably identify its own output underscores a significant black box problem in AI development: we are creating tools that can generate content with a level of sophistication that exceeds our current ability to verify or audit it.
What to Watch
The implications of this development are multifaceted and high-stakes. From a legal perspective, the ability to fabricate a fracture or a tumor that is indistinguishable from a real one creates a massive vulnerability for fraudulent litigation and insurance claims. In a clinical setting, if a hospital’s network were compromised, hackers could inject synthetic images into patient records. This could lead to unnecessary surgeries, delayed treatments, or a complete breakdown of trust in digital medical records. Dr. Tordjman’s warning that we are seeing the tip of the iceberg suggests that as these generative models become more accessible and powerful, the potential for clinical chaos grows exponentially.
To mitigate these risks, the research community is calling for robust digital safeguards. One proposed solution is the implementation of invisible watermarks embedded directly into the metadata or pixel structure of medical images at the point of capture. This would create a verifiable chain of custody for every X-ray, MRI, and CT scan. However, as the detection accuracy of even the most advanced LLMs shows, the arms race between synthetic generation and forensic detection is only just beginning. The medical community must now treat the integrity of imaging data with the same level of security and scrutiny as patient privacy and financial records. Future research will likely focus on developing specialized detection models that can outperform general-purpose LLMs in identifying subtle synthetic artifacts.
From the Network
AI-Generated Medical Deepfakes Fool Radiologists and LLMs Alike
A landmark study from the Icahn School of Medicine at Mount Sinai reveals that synthetic X-ray images created by AI can deceive experienced radiologists and even the advanced models that generated the
LegalAI-Generated Medical Deepfakes Threaten Litigation Integrity and Cybersecurity
A landmark study reveals that AI-generated X-rays can deceive both human radiologists and advanced AI models, posing a severe risk for fraudulent litigation and medical record integrity. Researchers w
How we covered this story
Every story in our ai coverage is assembled from multiple primary sources, cross-referenced for factual consistency, and scored along three independent dimensions: sentiment, operational impact, and source-cluster confidence. Single-source rumors and unverifiable claims do not pass our editorial gate. When a story shows "Verified by N sources" with N≥2, the development is independently corroborated; when N=1, we mark it explicitly so readers can weigh the signal accordingly.
Impact scoring uses a 1-10 scale weighted toward regulatory, financial, and operational consequence rather than coverage volume. A topic that runs in every outlet but moves no real decisions ranks lower than a niche regulatory filing that reshapes how operators in the ai space have to behave. Read our full methodology for the scoring rubric, our glossary for term definitions, and our trends index for the longitudinal view across the beat.
| Signal on this page | What it tells you |
|---|---|
| Verified by N sources | Independent corroboration count. N≥2 is our confidence floor; N=1 is marked explicitly. |
| Impact score (1-10) | Regulatory + financial + operational weight. 8+ signals an experienced-operator action item. |
| Sentiment | Five-tier classification trained on labeled ai-specific corpora. |
| Timeline | Where applicable, the related-events sequence that contextualizes today's development. |