
LLM4DS-Benchmark Dataset Description The LLM4DS-Benchmark dataset is a resource designed to evaluate the performance of Large Language Models (LLMs) on data science coding tasks. It was developed as part of the research presented in the paper “Empirical Benchmarking of Large Language Models for Data Science Coding: Accuracy, Efficiency, and Limitations.” This new version of the dataset includes: Prompt templates for various types of problems. Problem IDs with associated metadata and reference links. Official solution code extracted from Stratascratch demo solutions, along with the corresponding generated code for successful LLM outputs. LLM4DS-Execution-Results.xlsx: A comprehensive spreadsheet listing the selected problems and their execution results. Similarity scores comparing the platform-provided official solutions and the generated code. Dataset Contents 1. Prompt Templates (prompt-templates/) • This folder contains the prompt templates used for three types of problems: algorithm, analytical, and visualization. These prompts were used to automate the generation of the problems listed in the .json files to the prompt format. 2. Problem Metadata (problems-id/) • Each easy.json, medium.json, and hard.json files organize the selected problems by difficulty and contain metadata for the selected problems, including: • ID: Unique identifier for the problem. • Link: Direct URL to the problem on the StrataScratch platform. • Type: Problem category (algorithm, analytical, or visualization). • Topics: Main topics associated with the problem. • Public Problem Descriptions: While the problems are publicly available on the StrataScratch platform, we have omitted full problem descriptions from our repository. Instead, we provide the problem IDs and direct links to the StrataScratch website, ensuring compliance with their terms of service. 3. Official and Generated Code Solutions (official-and-generated-code/) • This folder contains the official solution code extracted from Stratascratch demo solutions, along with the corresponding generated code for successful LLM outputs. It is organized as follows: • Categories: Subfolders for algorithm, analytical, and visualization problems. • Difficulty Levels: Each category contains subfolders for easy, medium, and hard problems. • Problem IDs: Solutions for individual problems are stored in subfolders named after their problem IDs. • File Format: Solutions are saved as .py files. 4. Similarity Computation (similarity-computation/) Compares the official solution code extracted from Stratascratch with the code generated by LLMs, using similarity metrics. 5. Execution Results (LLM4DS-Execution-Results.xlsx) • This Excel file provides a detailed summary of the dataset and the evaluation results. It includes the following sheets: - Selected Problems: Metadata for the 100 selected problems, including: • Topics: Main topics covered by each question. • Reasoning: Why the problem was selected. • Company: The company that originally used the problem. - Copilot-Results, ChatGPT-Results, Perplexity-Results, Claude-Results, and GitHub Copilot: Performance results for each LLM on 100 data science problems, including the number of trials and similarity scores. EXTRA: By uploading this spreadsheet to Google Colab, you can reproduce all analytics results reported in the paper: https://colab.research.google.com/drive/1zmu2DUYkEj5oD5CHIHRT6UOQtZhZsgQW?usp=sharing Code for converting Stratascratch problems to prompts using our prompt templates: https://github.com/ABSanthosh/RA-Week-3-work For further details, refer to the linked paper.
LLM4DS
LLM4DS
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
