Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Other literature type . 2025
License: CC BY
Data sources: ZENODO
ZENODO
Other literature type . 2025
License: CC BY
Data sources: Datacite
ZENODO
Other literature type . 2025
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

AI on the Frontline: Evaluating Large Language Models in Real-World Conflict Resolution

Authors: Institute for Integrated Transitions;

AI on the Frontline: Evaluating Large Language Models in Real-World Conflict Resolution

Abstract

This groundbreaking study authored by Nathalie Bussemaker and Mark Freeman and published by the Institute for Integrated Transitions (IFIT) reveals that all major large language models (LLMs) are providing dangerous conflict resolution advice without conducting basic due diligence that any human mediator would consider essential. IFIT tested six leading AI models including ChatGPT, Deepseek, Grok, and others on three real-world prompt scenarios from Syria, Sudan, and Mexico. Each LLM response, generated on June 26, 2025, was evaluated by two independent five-person teams of IFIT researchers across ten key dimensions, based on well-established conflict resolution principles such as due diligence and risk disclosure. Scores were assigned on a 0 to 10 scale for each dimension to assess the quality of each LLM’s advice. A senior expert sounding board of IFIT conflict resolution experts from Afghanistan, Colombia, Mexico, Northern Ireland, Sudan, Syria, the United States, Uganda, Venezuela, and Zimbabwe then reviewed the findings to assess implications for real-world practice. From a total possible point value of 100/100, the average score across all six models was only 27 points. The maximum score was obtained by Google Gemini with 37.8/100, followed by Grok with 32.1/100, ChatGPT with 24.8/100, Mistral with 23.3/100, Claude with 22.3/100, and DeepSeek last with 20.7/100. All scores represent a failure to abide by minimal professional conflict resolution standards and best practices.

Related Organizations
Keywords

LLM, peace, Syria, IFIT, conflict, large language model, Google Gemini, DeepSeek, Claude, Sudan, ChatGPT, AI, negotiation, peacebuilding, Gork, conflict resolution, Mexico, prompt, Mistral, due diligence

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average