GazeVal

Abstract

The demand for high-quality synthetic data for model training and augmentation has never been greater in medical imaging. However, current evaluations predominantly rely on computational metrics that fail to align with human expert recognition. This leads to synthetic images that may appear realistic numerically but lack clinical authenticity, posing significant challenges in ensuring the reliability and effectiveness of AI-driven medical tools. To address this gap, we introduce GazeVal, a practical framework that synergizes expert eye-tracking data with direct radiological evaluations to assess the quality of synthetic medical images. GazeVal leverages gaze patterns of radiologists as they provide a deeper understanding of how experts perceive and interact with synthetic data in different tasks (i.e., diagnostic or Turing tests). Experiments with sixteen radiologists revealed that 96.6\% of the generated images (by the most recent state-of-the-art AI algorithm) were identified as fake, demonstrating the limitations of generative AI in producing clinically accurate images.

Methods

Given a real chest X-ray image and its associated report, a synthetic chest X-ray image is generated based on the report by an LDM-based generative model, RoentGen. The synthetic images are then placed in a dataset with real images and reviewed by radiologists under two task settings. First, radiologists are asked to provide a diagnosis without knowing that synthetic images are included. Second, they are asked to determine whether each image is real or generated. At the same time, we use eye tracking to record their eye gaze during the tasks. Using the gaze data and their task answers, we can compare synthetic and real X-rays across various metrics to evaluate the quality of the synthetic images.

Left: Eye tracking setup. The radiologist is viewing the image on the monitor with the eye tracker in between them. Middle: EyeLink 1000 Plus eye-tracker view with calibration software. Right: Eye-tracker and example attention map.

Results

(1) This means the diagnostic agreement for real and synthetic images are similar.

(2) Radiologists can accurately distinguish between real and synthetic images, which means the current state-of-the-art generative model are unable to generate highly realistic radiological images yet.

(3) Differing viewing patterns of real and synthetic X-rays are present and increase with prolonged visual processing.

(4) Using both Fixation-Based and Scanpath-Based Congruency, as well as shared attention calculations, we quantitatively determine that visual attention is task-guided.

(5) Generative models struggle more to generate realistic X-rays when containing pathologies.

BibTeX

@misc{wong2025eyestelltruthgazeval,
      title={Eyes Tell the Truth: GazeVal Highlights Shortcomings of Generative AI in Medical Imaging}, 
      author={David Wong and Bin Wang and Gorkem Durak and Marouane Tliba and Akshay Chaudhari and Aladine Chetouani and Ahmet Enis Cetin and Cagdas Topel and Nicolo Gennaro and Camila Lopes Vendrami and Tugce Agirlar Trabzonlu and Amir Ali Rahsepar and Laetitia Perronne and Matthew Antalek and Onural Ozturk and Gokcan Okur and Andrew C. Gordon and Ayis Pyrros and Frank H. Miller and Amir Borhani and Hatice Savas and Eric Hart and Drew Torigian and Jayaram K. Udupa and Elizabeth Krupinski and Ulas Bagci},
      year={2025},
      eprint={2503.20967},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.20967}, 
}

Eyes Tell the Truth: GazeVal Highlights Shortcomings of Generative AI in Medical Imaging

Overview of the proposed GazeVal framework, which introduces two tasks with corresponding evaluation metrics to quantitatively assess the quality of synthetic Chest X-ray images with expert knowledge.

Abstract

Methods

Results

BibTeX