bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–04–27
seven papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. Environ Evid. 2025 Apr 23. 14(1): 7
      In this paper we show that OpenAI's Large Language Model (LLM) GPT perform remarkably well when used for title and abstract eligibility screening of scientific articles and within a (systematic) literature review workflow. We evaluated GPT on screening data from a systematic review study on electric vehicle charging infrastructure demand with almost 12,000 records using the same eligibility criteria as human screeners. We tested 3 different versions of this model that were tasked to distinguishing between relevant and irrelevant content by responding with a relevance probability between 0 and 1. For the latest GPT-4 model (tested in November 2023) and probability cutoff 0.5 the recall rate is 100%, meaning no relevant papers were missed and using this mode for screening would have saved 50% of the time that would otherwise be spent on manual screening. Experimenting with a higher cut of threshold can save more time. With threshold chosen so that recall is still above 95% for GPT-4 (where up to 5% of relevant papers might be missed), the model could save 75% of the time spent on manual screening. If automation technologies can replicate manual screening by human experts with effectiveness, accuracy, and precision, the work and cost savings are significant. Furthermore, the value of a comprehensive list of relevant literature, rather quickly available at the start of a research project, is hard to understate. However, as this study only evaluated the performance on one systematic review and one prompt, we caution that more test and methodological development is needed, and outline the next steps to properly evaluate rigor and effectiveness of LLMs for eligibility screening.
    Keywords:  Artificial Intelligence; Large Language Model; Study selection; Systematic maps; Systematic reviews
    DOI:  https://doi.org/10.1186/s13750-025-00360-x
  2. Comput Struct Biotechnol J. 2025 ;29 138-148
      With the growing use of nanomaterials (NMs), assessing their toxicity has become increasingly important. Among toxicity assessment methods, computational models for predicting nanotoxicity are emerging as alternatives to traditional in vitro and in vivo assays, which involve high costs and ethical concerns. As a result, the qualitative and quantitative importance of data is now widely recognized. However, collecting large, high-quality data is both time-consuming and labor-intensive. Artificial intelligence (AI)-based data extraction techniques hold significant potential for extracting and organizing information from unstructured text. However, the use of large language models (LLMs) and prompt engineering for nanotoxicity data extraction has not been widely studied. In this study, we developed an AI-based automated data extraction pipeline to facilitate efficient data collection. The automation process was implemented using Python-based LangChain. We used 216 nanotoxicity research articles as training data to refine prompts and evaluate LLM performance. Subsequently, the most suitable LLM with refined prompts was used to extract test data, from 605 research articles. As a result, data extraction performance on training data achieved F1D.E. (F1 score for Data Extraction) ranging from 84.6 % to 87.6 % across different LLMs. Furthermore, using the extracted dataset from test set, we constructed automated machine learning (AutoML) models that achieved F1N.P. (F1 score for Nanotoxicity Prediction) exceeding 86.1 % in predicting nanotoxicity. Additionally, we assessed the reliability and applicability of models by comparing them in terms of ground truth, size, and balance. This study highlights the potential of AI-based data extraction, representing a significant contribution to nanotoxicity research.
    Keywords:  Automated machine learning; Data extraction; LangChain; Large Language Models; Nanotoxicity; Prompt engineering
    DOI:  https://doi.org/10.1016/j.csbj.2025.03.052
  3. Evid Based Toxicol. 2024 Nov 11. 2(1): 2421192
       Background: Systematic review (SR) methods are relied upon to develop transparent, unbiased, and standardized human health chemical assessments. The expectation is that these assessments will have discovered and evaluated all of the available information in a trackable, transparent, and reproducible manner inherent to SR principles. The challenge is that chemical assessment development relies on mostly literature-based data using manual approaches that are not scalable. Various SR tools have increased the efficiency of assessment development by implementing semi-automated approaches (human in the loop) for data discovery (literature search and screening) and enhanced data repositories with standardized data collection and curation frameworks. Yet filling these repositories with data extractions has remained a manual process and connecting the various tools together in one interoperable workflow remains challenging.
    Objectives: The objective of this protocol is to explore incorporation of a semi-automated data extraction tool (Dextr) into a chemical assessment workflow and understand if the new tool improves overall user experience.
    Methods: The workflow will use template systematic evidence map (SEM) methods developed by the Environmental Protection Agency for the identification of included studies. The methods described focus on the data extraction component of the workflow using a fully manual or a semi-automated (human in the loop) data extraction approach. Both the manual and semi-automated data extractions will occur in Dextr. The new data extraction tool will be evaluated for user experience and whether the data extracted using the automated approach meets or exceeds metrics (precision, recall, and F1 score) for a fully manual data extraction.
    Discussion: Artificial intelligence (AI) and machine learning (ML) methods have rapidly advanced and show promise in achieving operational efficiencies in chemical assessment workflows by supporting automated or semi-automated SR methods, possibly improving the user experience. Yet incorporating advances into sustainable workflows has remained a challenge. Whether using a tool like Dextr improves operational efficiencies and the user experience remains to be determined.
    Keywords:  Artificial intelligence; machine learning; risk assessment; systematic evidence map; systematic review
    DOI:  https://doi.org/10.1080/2833373x.2024.2421192
  4. Syst Rev. 2025 Apr 22. 14(1): 92
      The rise of powerful search engines (e.g., Google) make the searching for gray literature more feasible within the time and resources of a typical systematic review. However, there are no hypothesis-testing studies to guide us on how to conduct such a search. It is our belief that the "best practices" for incorporating Google searches might come from the collection of experiential evidence that users have had, from which can be drawn some tentative conclusions. It is our intention with this communication to relay our experience with Google searches for five projects and the lessons we think we have learned. We invite our systematic review colleagues to contribute their own experiences and thus to build up the experiential evidence about when and how to use Google as a search engine to supplement traditional computerized database searches.
    DOI:  https://doi.org/10.1186/s13643-025-02836-w
  5. J Med Internet Res. 2025 Apr 24. 27 e71521
       BACKGROUND: Qualitative research is crucial for understanding the values and beliefs underlying individual experiences, emotions, and behaviors, particularly in social sciences and health care. Traditionally reliant on manual analysis by experienced researchers, this methodology requires significant time and effort. The advent of artificial intelligence (AI) technology, especially large language models such as ChatGPT (OpenAI), holds promise for enhancing qualitative data analysis. However, existing studies have predominantly focused on AI's application to English-language datasets, leaving its applicability to non-English languages, particularly structurally and contextually complex languages such as Japanese, insufficiently explored.
    OBJECTIVE: This study aims to evaluate the feasibility, strengths, and limitations of ChatGPT-4 in analyzing qualitative Japanese interview data by directly comparing its performance with that of experienced human researchers.
    METHODS: A comparative qualitative study was conducted to assess the performance of ChatGPT-4 and human researchers in analyzing transcribed Japanese semistructured interviews. The analysis focused on thematic agreement rates, interpretative depth, and ChatGPT's ability to process culturally nuanced concepts, particularly for descriptive and socio-culturally embedded themes. This study analyzed transcripts from 30 semistructured interviews conducted between February and March 2024 in an urban community hospital (Hospital A) and a rural university hospital (Hospital B) in Japan. Interviews centered on the theme of "sacred moments" and involved health care providers and patients. Transcripts were digitized using NVivo (version 14; Lumivero) and analyzed using ChatGPT-4 with iterative prompts for thematic analysis. The results were compared with a reflexive thematic analysis performed by human researchers. Furthermore, to assess the adaptability and consistency of ChatGPT in qualitative analysis, Charmaz's grounded theory and Pope's five-step framework approach were applied.
    RESULTS: ChatGPT-4 demonstrated high thematic agreement rates (>80%) with human researchers for descriptive themes such as "personal experience of a sacred moment" and "building relationships." However, its performance declined for themes requiring deeper cultural and emotional interpretation, such as "difficult to answer, no experience of sacred moments" and "fate." For these themes, agreement rates were approximately 30%, revealing significant limitations in ChatGPT's ability to process context-dependent linguistic structures and implicit emotional expressions in Japanese.
    CONCLUSIONS: ChatGPT-4 demonstrates potential as an auxiliary tool in qualitative research, particularly for efficiently identifying descriptive themes within Japanese-language datasets. However, its limited capacity to interpret cultural and emotional nuances highlights the continued necessity of human expertise in qualitative analysis. These findings emphasize the complementary role of AI-assisted qualitative research and underscore the importance of further advancements in AI models tailored to non-English linguistic and cultural contexts. Future research should explore strategies to enhance AI's interpretability, expand multilingual training datasets, and assess the applicability of emerging AI models in diverse cultural settings. In addition, ethical and legal considerations in AI-driven qualitative analysis require continued scrutiny.
    Keywords:  ChatGPT; large language models; qualitative research; sacred moment(s); thematic analysis
    DOI:  https://doi.org/10.2196/71521
  6. Pharmaceut Med. 2025 Apr 21.
      Pharmacovigilance is the science of collection, detection, and assessment of adverse events associated with pharmaceutical products for the ongoing monitoring and understanding of those products' safety profiles. Part of this process, signal management, encompasses the activities of signal detection, signal validation/confirmation, signal evaluation, and ultimately, final assessment as to whether a safety signal constitutes a new causal adverse drug reaction. Artificial intelligence is a group of technologies including machine learning and natural language processing that are revolutionizing multiple industries through intelligent automation. Here, we present a critical evaluation of studies leveraging artificial intelligence in signal management to characterize the benefits and limitations of the technology, the level of transparency, and our perspective on best practices for the future. To this end, PubMed and Embase were searched cumulatively for terms pertaining to signal management and artificial intelligence, machine learning, or natural language processing. Information pertaining to the artificial intelligence model used, hyperparameter settings, training/testing data, performance, feature analysis, and more was extracted from included articles. Common signal detection methods included k-means, random forest, and gradient boosting machine. Machine learning algorithms generally outperformed traditional frequentist or Bayesian measures of disproportionality per various metrics, showing the potential utility of advanced machine learning technologies in signal detection. In signal validation and evaluation, natural language processing was typically applied. Overall, methodological transparency was mixed and only some studies leveraged "gold standard" publicly available positive and negative control datasets. Overall, innovation in pharmacovigilance signal management is being driven by machine learning and natural language processing models, particularly in signal detection, in part because of high-performing bagging methods such as random forest and gradient boosting machine. These technologies may be well poised to accelerate progress in this field when used transparently and ethically. Future research is needed to assess the applicability of these techniques across various therapeutic areas and drug classes in the broader pharmaceutical industry.
    DOI:  https://doi.org/10.1007/s40290-025-00561-2
  7. Ann Biomed Eng. 2025 Apr 24.
      DeepSeek, an open-source multimodal Large Language Model (LLM), was launched by the Chinese startup (Hangzhou DeepSeek Artificial Intelligence Basic Technology Research Co., Ltd.). Despite the lack of advanced artificial intelligence (AI) chips, the performance of its milestone version, "DeepSeek-V3," has set an unprecedented benchmark among LLMs, surpassing existing models. Notably, the opportunity to deploy this model in the local system helps build better-performing "distilled versions" suitable for medical research (hypothesis generation, drafting patient consent forms and biostatistical analysis, etc.) and clinical practice (differential diagnosis from symptom clusters, current guideline-based treatment protocol design, interactive medical training, personalized patient education, etc.). However, privacy and security risks, ethical uncertainties, and diversified global AI regulations hinder its potential for sustainable integration into real-world applications.
    Keywords:  AI regulations; Clinical practice; DeepSeek; Large Language Models; Medical research; Performance; Privacy; Security
    DOI:  https://doi.org/10.1007/s10439-025-03738-7