Benchmarking Hallucination in Turkish Large Language Models: A Multi-Field Comparative Study

The field of natural language processing has made revolutionary progress with the advent oflarge language models (LLMs). However, the reliability of these models is severelycompromised by the hallucination problem, which refers to the generation of inconsistent orunrealistic information. This study presents a comprehensive comparison of the hallucinationperformance of five different current large language models (DeepSeek, ChatGPT 5.1, Grok,Claude Opus 4.5, and Gemini Pro 3.0) across three fields: medicine, law, and up-to-datestatistical information. A dataset consisting of 597 question-answer pairs originally generatedin Turkish is used in the study. The answers generated by the models to these questions areevaluated with the cosine similarity metric using multilingual sentence transformers.Experimental results show that the DeepSeek model has the lowest hallucination rate, with anaverage similarity score of 0.7709. In category-based analyses, the models struggled most withmedical questions (0.6817 accuracy score). After medical data, the most challenging field waslegal information (0.7292 accuracy score), and the least challenging field was up-to-datestatistics (0.8358 accuracy score). This is the first comprehensive study to evaluate thehallucination performance of large language models using a multi-field Turkish languagedataset, and it offers important insights for the use of LLM in critical areas.

Related Organizations

Duzce University
Turkey

Keywords

Hallucination Analysis, Large Language Models, Turkish Natural LanguageProcessing, Semantic Similarity Analysis

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average