Sci Rep. 2025 Jun 03. 15(1): 19379
Systematic literature review (SLR) is an important tool for Health Economics and Outcomes Research (HEOR) evidence synthesis. SLRs involve the identification and selection of pertinent publications and extraction of relevant data elements from full-text articles, which can be a manually intensive procedure. Previously we developed machine learning models to automatically identify relevant publications based on pre-specified inclusion and exclusion criteria. This study investigates the feasibility of applying Natural Language Processing (NLP) approaches to automatically extract data elements from the relevant scientific literature. First, 239 full-text articles were collected and annotated for 12 important variables including study cohort, lab technique, and disease type, for proper SLR summary of Human papillomavirus (HPV) Prevalence, Pneumococcal Epidemiology, and Pneumococcal Economic Burden. The three resulting annotated corpora are shared publicly at [ https://github.com/Merck/NLP-SLR-corpora ], to provide training data and a benchmark baseline for the NLP community to further research this challenging task. We then compared three classic Named Entity Recognition (NER) algorithms, namely Conditional Random Fields (CRF), Long Short-Term Memory (LSTM), and the Bidirectional Encoder Representations from Transformers (BERT) models, to assess performance on the data element extraction task. The annotation corpora contain 4,498, 579, and 252 annotated entity mentions for HPV Prevalence, Pneumococcal Epidemiology, and Pneumococcal Economic Burden tasks respectively. Deep learning algorithms achieved superior performance in recognizing the targeted SLR data elements, compared to conventional machine learning algorithms. LSTM models have achieved 0.890, 0.646 and 0.615 micro-averaged F1 scores for three tasks respectively. CRF models could not provide comparable performance on most of the elements of interest. Although BERT-based models are known to generally achieve superior performance on many NLP tasks, we did not observe improvement in our three tasks. Deep learning algorithms have achieved superior performance compared with machine learning models on multiple SLR data element extraction tasks. LSTM model, in particular, is more preferable for deployment in supporting HEOR SLR data element extraction, due to its better performance, generalizability, and scalability as it's cost-effective in our SLR benchmark datasets.