Cureus. 2025 Aug;17(8): e90026
Systematic and scoping reviews are essential in palliative care, yet they are time-consuming and resource-intensive. Recent advancements in artificial intelligence, particularly large language models (LLMs), have shown promise in enhancing the efficiency of literature screening. However, their feasibility and accuracy in scoping reviews remain unclear. In this study, we aimed to evaluate the feasibility and performance of LLM-assisted citation screening for a scoping review on nonpharmacological interventions for delirium in patients with cancer. This prospective simulation study assessed the accuracy of three LLMs, GPT-4 Turbo, GPT-4o, and model o1 (OpenAI, San Francisco, CA, USA), in screening titles and abstracts. The dataset was derived from a previously conducted scoping review. Two reference standards were used for comparison: title/abstract screening and full-text screening results from conventional human review. LLMs were prompted using standardized inclusion and exclusion criteria based on the Population, Concept, and Context (PCC) framework. Sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were calculated for each model. The sensitivity and specificity were 0.43 (95% CI, 0.06-0.80) and 0.99 (95% CI, 0.99-1.00) for GPT-4 Turbo, 0.71 (95% CI, 0.38-1.00) and 0.97 (95% CI, 0.96-0.98) for GPT-4o, and 1.00 (95% CI, 1.00-1.00) and 0.91 (95% CI, 0.89-0.92) for o1, respectively. Compared with reference standard 2 (full-text screening results from conventional citation screening), the sensitivity and specificity were 1.00 (95% CI, 1.00-1.00) and 0.99 (95% CI, 0.99-1.00) for GPT-4 Turbo, 1.00 (95% CI, 1.00-1.00) and 0.97 (95% CI, 0.96-0.98) for GPT-4o, and 1.00 (95% CI, 1.00-1.00) and 0.90 (95% CI, 0.89-0.92) for o1, respectively. All models demonstrated high NPVs, indicating strong reliability in excluding irrelevant studies. However, PPVs were low across all models, reflecting a high false-positive rate. Newer LLMs, particularly model o1, demonstrated high sensitivity and acceptable specificity, supporting their use as preliminary screening tools in scoping reviews. High NPVs suggest LLMs are reliable for ruling out irrelevant citations, thereby streamlining the initial screening phase. However, consistently low PPVs raise concerns about increased reviewer burden due to false positives, emphasizing the necessity of human validation. These findings support the cautious integration of LLMs into literature screening workflows, treating their outputs as supportive tools rather than replacements for expert judgment.
Keywords: artificial intelligence (ai); cancer; large language models; nonpharmacological intervention; scoping review