Stud Health Technol Inform. 2025 May 15. 327 904-905
Large Language Models (LLMs) offer potential for automating systematic reviews, a labor-intensive process in evidence-based medicine. We evaluated GPT-4o, GPT-4o-mini, and Llama 3.1:8B on abstract screening and risk of bias assessment using 12 Cochrane drug intervention reviews. GPT-4o achieved the best screening performance (recall 0.894, precision 0.492). We propose a one-shot inclusivity adjustment method enabling threshold modulation without repeated inferences. For risk of bias, accuracy varied by domain, highest in random sequence generation (0.873), and lowest in selective reporting (0.418). Our findings demonstrate LLMs' practical utility and current limitations in automating systematic reviews.
Keywords: Abstract Screening; Large Language Models; Risk of Bias; Systematic Review