
SOCRATES-300K: Large-Scale Hallucination Detection Dataset for Language Models SOCRATES-300K is a dataset of 297,795 model responses across 10 open-source language models with verified hallucination labels, embeddings, for comparing hallucination detection methods. It accompanies the paper "A Geometric Analysis of Small-sized Language Model Hallucinations" (Ricco, Onofri, Cima, Cresci, Di Pietro) Dataset description The dataset contains 297,795 text responses generated from 200 factual prompts across 10 language models. For each of the 200 prompts, 150 responses are generated per model, yielding the full 297,795-response collection. The models employed to generate the dataset are the following: - mistralai/Mistral-7B-v0.1 - google/gemma-2-9b - google/gemma-2-27b - Upstage/SOLAR-10.7B-v1.0 - microsoft/phi-4 - Qwen/Qwen2.5-14B - Qwen/Qwen2.5-32B - deepseek-ai/deepseek-llm-7b-base - meta-llama/Llama-3.1-8B - 01-ai/Yi-1.5-9B Responses are all labelled using Claude 4.5 Sonnet API as an LLM-as-a-judge approach, ensuring consistent evaluation across all of them. Hallucination labels are binary: 1 indicates the response contains factual inaccuracies (hallucination), 0 indicates the response is factually accurate (genuine). The full dataset was made of 300,000 responses with 2,205 tagged as 2, indicating then the model does not know the response. In this version we removed these responses. Each response is represented in the dataset in two forms: the original raw text and a stemmed version created through Porter Stemmer. Both the raw and the stemmed responses are embedded independently using the all-MiniLM-L6-v2 embedding model, which produces 384-D dense vectors for each one. File format and loading The dataset is provided in Apache Parquet format with lossless compression, resulting in a file size of approximately 974 MB. This format is compatible with major data analysis tools including pandas, PyArrow, DuckDB, and SQL engines, making it easily accessible across different analysis platforms. Required libraries to load and work with the dataset are: pandas (for data manipulation and loading), pyarrow (for Parquet file handling), and numpy (for numerical operations). These can be installed via pip with: pip install pandas pyarrow numpy. To load the dataset in Python: import pandas as pd df = pd.read_parquet('SOCRATES-300K.parquet') print(df.shape) # (297795, 11) Dataset schema The dataset contains 11 columns organized as follows: model_id (int): Numerical identifier of the language model, ranging from 0 to 9, corresponding to the 10 models employed. prompt_id (int): Unique identifier for each prompt, ranging from 0 to 199 across the 200 factual questions used in the study. The prompts are presents in the file prompts.xlsx with the corresponding prompt_id year (int): Temporal variant of the prompt, taking values 2020 or 2022 to represent the two time periods considered. response_index (int): Sequential index of the response for each prompt, ranging from 0 to 149, since 150 responses are generated per prompt. response (string): The complete, unmodified text generated by the language model in response to the prompt. hallucination (int): Binary label indicating hallucination status—1 denotes a hallucinated (factually incorrect) response, 0 denotes a genuine (factually correct) response. verification (bool): Boolean flag indicating whether the response has been verified and labeled; all entries are True. temperature (int): The sampling temperature parameter used during response generation; all entries are 1 (fixed across the dataset). stemmed_response (string): A preprocessed version of the response text with tokenization, lowercasing, stopword removal, and punctuation removal applied. response_embeddings (np.array[float]): A 384-D dense vector embedding of the original response generated using the embedding model. stemmed_response_embeddings (np.array[float]): A 384-D dense vector embedding of the stemmed response text generated using the same embedding model. What makes this dataset useful - Multi-model benchmark: All 100 prompts issued to every model, enabling fair cross-model comparison of hallucination rates. - Verified API labels: All responses labeled using Claude Sonnet 4.5 via Anthropic API, with consistent verification status (no unlabeled data). - Pre-computed embeddings: Response embeddings shipped with the data, no need to recompute; immediately usable for analysis and evaluation. - Reproducible experiments: Includes all data needed to reproduce and evaluate algorithm of hallucination detection. Intended uses - Structural analysis of hallucinated and not hallucinated responses - Training and evaluating hallucination detection classifiers - Studying hallucination rates across different model architectures Companion resources Paper: A Geometric Analysis of Small-sized Language Model Hallucinations Text - generation scripts (the code used to produce this dataset): Socrates-300K Licensing Dataset (images and metadata): Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).
