In recent years, artificial intelligence (AI) has begun to make its way into traditionally complex fields such as histopathology. In particular, Large Language Models (LLMs) like ChatGPT are increasingly being used to assist doctors in synthesizing clinical information, drafting reports, and supporting diagnoses. But to what extent can one rely on an algorithm when it comes to health and clinical decisions?
A recent study published in Virchows Archiv, titled Unveiling the Risks of ChatGPT in Diagnostic Surgical Pathology, sought to answer this question by systematically analyzing the strengths and limitations of ChatGPT 3.5 in anatomical pathology—the branch of medicine that formulates diagnoses based on human tissues, a crucial step in guiding treatment, especially for neoplastic diseases.
The study was led by Salvatore Lorenzo Renne, Associate Professor at Humanitas University, and Silvia Uccella, Full Professor, Director of the School of Specialization in Pathology at Humanitas University, and Head of the Pathology Unit at IRCCS Istituto Clinico Humanitas. It was conducted by Vincenzo Guastafierro, a pathology resident at the same institution. The publication earned Dr. Guastafierro the Anzalone Award, presented on October 18 by the Ordine dei Medici Chirurghi e degli Odontoiatri della Provincia di Milano (Order of Physicians and Dentists of the Province of Milan).
The study: how it was conducted
The authors described 50 histopathological clinical cases covering ten different areas of pathology, providing in each case all the information necessary for a diagnosis. For each case, they asked ChatGPT to propose a diagnosis, as a pathologist would.
Specifically, each scenario was presented to ChatGPT in several ways—sometimes asking it to evaluate proposed diagnostic hypotheses, and other times asking for the most likely ones without restriction. Each version was also tested both with and without a request for scientific references to justify the answers. The responses were then assessed by six expert pathologists, who evaluated their usefulness and accuracy, noting any errors.
Overall, ChatGPT provided responses considered useful in about 62% of cases, demonstrating a good ability to understand technical language and construct coherent reasoning. However, only about one-third of the responses were completely error-free, confirming that while the model is linguistically competent, it does not yet ensure consistent diagnostic accuracy.
Regarding bibliographic citations, approximately 70% were correct, while a considerable percentage were inaccurate (12.1%) or even fabricated (17.8%), a phenomenon known as hallucination.
According to the authors, these results do not represent a failure but rather a starting point to understand how AI tools can be integrated into medical practice. ChatGPT shows interesting potential as support for diagnostic reasoning, but its use requires expert supervision and awareness of the model’s limitations.
Why these errors occur
Unlike a physician, a language model does not truly understand what it processes. ChatGPT generates text based on linguistic patterns and statistical probabilities drawn from large datasets, not from verified knowledge or causal reasoning. Moreover, it often fails to distinguish between up-to-date and outdated sources.
When confronted with complex cases—such as distinguishing between two similar tumors—the model tends to produce responses that are coherent in form and context but sometimes inaccurate in content. Likewise, citation errors occur because the system may combine journal names, authors, and titles in plausible but fictitious ways.
The authors emphasize that although ChatGPT demonstrates potential, its use must be cautious and it is not yet suitable for routine diagnostic practice. They suggest that in the future, specialized LLMs could serve as supportive assistants in complex contexts such as histologic diagnosis. Additionally, the high rate of inaccuracy observed in generated references highlights the need for caution when using ChatGPT as a learning tool for students.
In conclusion, the results confirm that ChatGPT cannot yet replace human judgment in histopathologic diagnosis, underscoring the indispensable role of the pathologist. The ultimate responsibility for diagnosis remains firmly in human hands.
A tool, not a substitute
The publication underscores a key concept: AI, however sophisticated, cannot replace the study, reasoning, and judgment of a physician. Pathology is not limited to textual analysis—it also requires visual interpretation of samples, integration of clinical data, and collaboration within the medical team, all of which depend on the irreplaceable contribution of the pathologist.
The real challenge is not to replace the human being, but to build a safe and transparent collaboration between humans and machines. Only in this way can technology become a reliable ally of medicine—and not a hidden risk behind perfectly formulated language.

