
This dataset provides scripts, reference data, and evaluation tools for benchmarking ChemGraph, an LLM-based molecular simulation framework. It includes outputs generated by four different language models: GPT-4o-mini, Claude-3.5-haiku, Qwen2.5-14B, and GPT-4o. Main Files and Descriptions - data_from_pubchempy.json: Structured chemical information obtained from PubChemPy. Serves as an input dataset for each experiment - manual_workflow.json: A manually constructed reference workflow representing true tool call sequences and outputs. Used for benchmarking LLM results. - llm_workflow_[...].json: A JSON file containing tool-use outputs generated by different LLMs. Includes additional metadata such as model name, timestamps and system prompt. ** Update history: October 7th, 2025: Uploaded the ChemGraph source code release associated with the manuscript. October 6th, 2025: Uploaded plotting data in the manuscript: evaluation_plot_data.json. August 29th, 2025: Expanded benchmark from 260 to 360. Reran all evaluations. Added GPT-4o multi-agent evaluation.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
