Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ IEEE Accessarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
IEEE Access
Article . 2025 . Peer-reviewed
License: CC BY
Data sources: Crossref
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
IEEE Access
Article . 2025
Data sources: DOAJ
DBLP
Article . 2025
Data sources: DBLP
versions View all 3 versions
addClaim

Comparing Generative AI Literature Reviews Versus Human-Led Systematic Literature Reviews: A Case Study on Big Data Research

Authors: Davide Tosi;

Comparing Generative AI Literature Reviews Versus Human-Led Systematic Literature Reviews: A Case Study on Big Data Research

Abstract

Generative Artificial Intelligence (GenAI) and Large Language Models (LLMs) are transforming research methodologies, including Systematic Literature Reviews (SLRs). While traditional, human-led SLRs are labor-intensive, AI-driven approaches promise efficiency and scalability. However, the reliability and accuracy of AI-generated literature reviews remain uncertain. This study investigates the performance of GPT-4-powered Consensus in conducting an SLR on Big Data research, comparing its results with a manually conducted SLR. To evaluate Consensus, we analyzed its ability to detect relevant studies, extract key insights, and synthesize findings. Our human-led SLR identified 32 primary studies (PSs) and 207 related works, whereas Consensus detected 22 PSs, with 16 overlapping with the manual selection and 5 false positives. The AI-selected studies had an average citation count of 202 per study, significantly higher than the 64.4 citations per study in the manual SLR, indicating a possible bias toward highly cited papers. However, none of the 32 PSs selected manually were included in the AI-generated results, highlighting recall and selection accuracy limitations. Key findings reveal that Consensus accelerates literature retrieval but suffers from hallucinations, reference inaccuracies, and limited critical analysis. Specifically, it failed to capture nuanced research challenges and missed important application domains. Precision, recall, and F1 scores of the AI-selected studies were 76.2%, 38.1%, and 50.6%, respectively, demonstrating that while AI retrieves relevant papers with high precision, it lacks comprehensiveness. To mitigate these limitations, we propose a hybrid AI-human SLR framework, where AI enhances search efficiency while human reviewers ensure rigor and validity. While AI can support literature reviews, human oversight remains essential for ensuring accuracy and depth. Future research should assess AI-assisted SLRs across multiple disciplines to validate generalizability and explore domain-specific LLMs for improved performance.

Related Organizations
Keywords

AI-assisted research, big data, generative AI, systematic literature review, large language models, Electrical engineering. Electronics. Nuclear engineering, TK1-9971

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    1
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
1
Average
Average
Average
gold