Physicians and artificial intelligence diverge in evaluating large language models on real clinical cases

While multimodal large language models (LLMs) demonstrate significant potential in healthcare applications, their clinical utility is difficult to appraise. Current evaluations of medical-assisting LLMs are often limited by sparse human expertise, narrow specialty scope, and reliance on multiple-choice benchmarks or synthetic vignettes, which can inflate performance and obscure clinical utility. We conducted a multicenter, multidisciplinary study in which more than 400 physicians—spanning seven