While multimodal large language models (LLMs) demonstrate significant potential in healthcare applications, their clinical utility is difficult to appraise. Current evaluations of medical-assisting LLMs are often limited by sparse human expertise, narrow specialty scope, and reliance on multiple-choice benchmarks or synthetic vignettes, which can inflate performance and obscure clinical utility. We conducted a multicenter, multidisciplinary study in which more than 400 physicians—spanning seven