![](https://i3.wp.com/static-content.springer.com/image/art%3A10.1186%2Fs12909-024-05534-8/MediaObjects/12909_2024_5534_Fig1_HTML.png?w=1180&resize=1180,1212&ssl=1)
![](https://i3.wp.com/static-content.springer.com/image/art%3A10.1186%2Fs12909-024-05534-8/MediaObjects/12909_2024_5534_Fig1_HTML.png?w=788&resize=788,0&ssl=1)
Study design
Focusing on the illnesses and conditions integral to the Japanese National Model Core Curriculum for undergraduate medical education (2022 revised edition) [15] and primary care training program in Japan [16], the illness scripts for 184 diseases were systematically generated using ChatGPT-4. Subsequently, three board-certified general physicians (YY, SU, and FF) assessed if the generated output reached the level required for graduating medical students. Finally, each illness script was graded on a three-point scale, that is, “A” denotes that the content proved sufficient for medical students, “B” denoted that it exhibited partial inadequacy, and “C” denotes that it was deemed inadequate in multiple aspects.
Large language model environment
The illness scripts were generated on July 25, 2023, using the July 20 version of GPT-4 (OpenAI, San Francisco, California, USA). GPT is a large language model (LLM) developed by OpenAI for natural language processing. Its dynamic response generation is based on probabilities the neural network derives from learned syntactic and semantic relationships in the text [17].
Selecting diseases for illness scripts
Commonly and frequently encountered diseases were selected due to their importance for medical students. Considering that the diseases managed in primary care overlap with those that medical students should learn about, the diseases studied in primary care training in Japan [16] were used as a reference. Among the 205 disease and symptom items representing the 16 areas targeted for appropriate management in primary care [16], 184 were identified as sufficiently relevant for the creation of the illness script. These diseases are included in the National Model Core Curriculum in Japan for undergraduate medical education (2022 revised edition) [15].
Physicians YY, SU, and FF established the exclusion criteria through collaborative discussions and excluded 21 items with minimal diagnostic contribution or mere symptomatology. Seventeen items (e.g., those associated with palliative care or non-critical symptoms, such as lower back pain) were omitted because they lacked the specificity for script creation. Furthermore, four items related to community-acquired pneumonia, herpes encephalitis, herpes infections, and adrenal insufficiency were excluded because they were pertinent to the input examples in the prompt. The English names for the 184 selected items were entered into the prompt based on the International Classification of Diseases, 11th Revision (ICD-11) [18] registered disease names (Supplementary Material).
Content to be entered into ChatGPT-4, program code
The prompts for ChatGPT-4 were carefully engineered to ensure their interpretability by generative AI while succinctly defining the desired outputs [19]. The output items referencing the proposed elements of illness scripts [2] were determined after discussions facilitated by one board-certified physician (YY) and fellow of internal medicine (DY). The input-specified key elements of the illness scripts included pathophysiology, epidemiology, time course, signs and symptoms, diagnosis, and treatment. The character limit per item was set at less than 50 characters, based on findings from prior illness scripts [2] and the general requirement that an average of 20–30 words per English sentence could be generated. Three output examples (community-acquired pneumonia, herpes zoster, and primary adrenal insufficiency) were added after key elements. The structured prompt for ChatGPT-4 was: [Create an illness script for < disease name > . List the following items in less than 50 characters each: [pathophysiology][epidemiology][time course][Symptoms and Signs][Diagnostics][and treatment]. The following is a reference example of an illness script. Example1), Example2), Example3)] (Fig. 1). This prompt was entered into ChatGPT once, and the output information was evaluated. No additional prompts were entered to indicate modifications.
Screenshot of the prompt input
Evaluation
A broader evaluation was conducted by physicians YY, SU, and FF to assess the generated illness script’s utility for medical students.
Following a discussion among the three evaluators, the usefulness of the illness scripts in this study was defined as the level at which each item contained the minimum amount of required information and would not cause inconvenience to a medical student learning to use the illness scripts for the first time. Initially, screening was conducted by physician YY to ensure that the output included the essential elements of the illness script: pathophysiology, epidemiology, time course, symptoms and signs, diagnosis, and treatment. Subsequently, the three evaluators rated the illness scripts with all output items on a five-point scale. The evaluation was structured on a five-point scale, where 1 denotes “not at all useful, needs overall revision,” and 5 denotes “very useful, no additional modifications needed.” To achieve a structured assessment, each item was evaluated considering the age and mode of onset, typical symptoms, essential diagnostic examinations, standard treatment, and adequacy of the course of treatment. Failure to meet these items resulted in a point deduction. The rating of each evaluator was summed, and each illness script was scored on a 15-point scale. Composite scores were categorized into three levels: 15, 14, and 13 or less, corresponding to “A,” “B,” and “C,” respectively. Moreover, any identified deficiencies in the illness scripts were discussed during the evaluation. Consequently, an “A” rating signifies a script that proved sufficiently informative for medical students and required no further modification, “B” is a script that was partially sufficient or required minor revision but was acceptable. “C” represents a script that was inadequate in several respects and necessitated multiple revisions. Then, we discussed the reasons for discrepancies in the evaluations and identified the main aspects that were lacking in the creation of the illness scripts by ChatGPT, along with considerations for their educational application.
Ethical considerations
This study did not involve human or animal participants, thereby obviating the need for ethical approval.