Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2025
License: CC BY NC
Data sources: ZENODO
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2025
License: CC BY NC
Data sources: ZENODO
ZENODO
Dataset . 2025
License: CC BY NC
Data sources: Datacite
ZENODO
Dataset . 2025
License: CC BY NC
Data sources: Datacite
ZENODO
Dataset . 2025
License: CC BY NC
Data sources: Datacite
versions View all 3 versions
addClaim

Data from: The QCML dataset, Quantum chemistry reference data from 33.5M DFT and 14.7B semi-empirical calculations

Authors: Ganscha, Stefan; Unke, Oliver T.; Ahlin, Daniel; Maennel, Hartmut; Kashubin, Sergii; Mueller, Klaus-Robert;

Data from: The QCML dataset, Quantum chemistry reference data from 33.5M DFT and 14.7B semi-empirical calculations

Abstract

Machine learning (ML) methods enable prediction of the properties of chemical structures without computationally expensive ab initio calculations. The quality of such predictions depends on the reference data that was used to train the model. In this work, we introduce the QCML dataset: A comprehensive dataset for training ML models for quantum chemistry. The QCML dataset systematically covers chemical space with small molecules consisting of up to 8 heavy atoms and includes elements from a large fraction of the periodic table, as well as different electronic states. Starting from chemical graphs, conformer search and normal mode sampling are used to generate both equilibrium and off-equilibrium 3D structures, for which various properties are calculated with semi-empirical methods (14.7 billion entries) and density functional theory (33.5 million entries). The covered properties include energies, forces, multipole moments, and other quantities, e.g. Kohn-Sham matrices. We provide a first demonstration of the utility of our dataset by training ML-based force fields on the data and applying them to run molecular dynamics simulations. The data is available as TensorFlow dataset (TFDS) and can be accessed from the publicly available Google Cloud Storage at gs://qcml-datasets/tfds/. (See "Directory structure" below.) For information on different access options (command-line tools, client libraries, etc), please see https://cloud.google.com/storage/docs/access-public-data. Storage API: Using the "directory structure" and "builder configurations" below, storage API links can be constructed, e.g.https://storage.mtls.cloud.google.com/qcml-datasets/tfds/qcml/dft_metadata/1.0.0/qcml-full.tfrecord-00010-of-00011 gcloud: Our example_usage.py uses the gcloud command-line tool for download. Web access via Google Cloud Console is possible for any authenticated Cloud user: https://console.cloud.google.com/storage/browser/qcml-datasets/. Directory structure gs://qcml-datasets (GCS Bucket) tfds (TFDS data directory) qcml (TFDS dataset name) dft_atomic_numbers (TFDS builder config name) 1.0.0 (Current version) dataset_info.json features.json qcml-full.tfrecord-X-of-Y (TFDS data shards, see below) ... dft_positions xtb_all Builder configurations Format: Builder config name: number of shards (rounded total size) Semi-empirical calculations: xtb_all: 85000 (69 TB) DFT calculations: dft_atomic_numbers: 11 (3 GB) dft_d4_atomic_charges: 11 (4 GB) dft_d4_c6_coefficients: 11 (4 GB) dft_d4_correction: 11 (8 GB) dft_d4_energy: 11 (2 GB) dft_d4_forces: 11 (7 GB) dft_d4_polarizabilities: 11 (4 GB) dft_force_field: 11 (18 GB) dft_force_field_d4: 110 (24 GB) dft_force_field_mbd: 110 (24 GB) dft_gfn0_dipole: 11 (3 GB) dft_gfn0_eeq_charges: 11 (4 GB) dft_gfn0_energy: 11 (2 GB) dft_gfn0_forces: 11 (7 GB) dft_gfn0_formation_energy: 11 (3 GB) dft_gfn0_orbital_energies_a: 11 (8 GB) dft_gfn0_orbital_occupations_a: 11 (8 GB) dft_gfn0_wiberg_bond_orders: 110 (29 GB) dft_gfn2_dipole: 11 (3 GB) dft_gfn2_energy: 11 (2 GB) dft_gfn2_forces: 11 (7 GB) dft_gfn2_formation_energy: 11 (3 GB) dft_gfn2_mulliken_charges: 11 (4 GB) dft_gfn2_orbital_energies_a: 11 (7 GB) dft_gfn2_orbital_occupations_a: 11 (7 GB) dft_gfn2_wiberg_bond_orders: 110 (29 GB) dft_is_outlier: 11 (2 GB) dft_mbd_c6_coefficients: 11 (4 GB) dft_mbd_correction: 11 (8 GB) dft_mbd_energy: 11 (2 GB) dft_mbd_forces: 11 (7 GB) dft_mbd_polarizabilities: 11 (4 GB) dft_metadata: 11 (11 GB) dft_multipole_moments: 11 (8 GB) dft_pbe0_core_hamiltonian_matrix: 110000 (30 TB) dft_pbe0_density_matrix_a: 110000 (30 TB) dft_pbe0_density_matrix_b: 110000 (3 TB) dft_pbe0_dipole: 11 (3 GB) dft_pbe0_electronic_free_energy: 11 (3 GB) dft_pbe0_energy: 11 (2 GB) dft_pbe0_forces: 11 (7 GB) dft_pbe0_formation_energy: 11 (3 GB) dft_pbe0_grid_density_a: 110000 (27 TB) dft_pbe0_grid_density_b: 110000 (3 TB) dft_pbe0_grid_density_gradient_a: 110000 (81 TB) dft_pbe0_grid_density_gradient_b: 110000 (10 TB) dft_pbe0_grid_density_laplacian_a: 110000 (27 TB) dft_pbe0_grid_density_laplacian_b: 110000 (3 TB) dft_pbe0_grid_kinetic_energy_density_a: 110000 (27 TB) dft_pbe0_grid_kinetic_energy_density_b: 110000 (3 TB) dft_pbe0_grid_points: 110000 (81 TB) dft_pbe0_grid_weight: 110000 (27 TB) dft_pbe0_guid: 11 (3 GB) dft_pbe0_hamiltonian_matrix_a: 110000 (30 TB) dft_pbe0_hamiltonian_matrix_b: 110000 (3 TB) dft_pbe0_has_equal_a_b_electrons: 11 (3 GB) dft_pbe0_hexadecapole: 11 (3 GB) dft_pbe0_hirshfeld_charges: 11 (4 GB) dft_pbe0_hirshfeld_dipoles: 11 (8 GB) dft_pbe0_hirshfeld_quadrupoles: 11 (11 GB) dft_pbe0_hirshfeld_spins: 11 (3 GB) dft_pbe0_hirshfeld_volume_ratios: 11 (4 GB) dft_pbe0_hirshfeld_volumes: 11 (4 GB) dft_pbe0_loewdin_charges: 11 (4 GB) dft_pbe0_loewdin_spins: 11 (3 GB) dft_pbe0_mulliken_charges: 11 (4 GB) dft_pbe0_mulliken_spins: 11 (3 GB) dft_pbe0_num_scf_iterations: 11 (3 GB) dft_pbe0_octupole: 11 (3 GB) dft_pbe0_orbital_coefficients_a: 110000 (30 TB) dft_pbe0_orbital_coefficients_b: 110000 (3 TB) dft_pbe0_orbital_energies_a: 110 (44 GB) dft_pbe0_orbital_energies_b: 11 (8 GB) dft_pbe0_orbital_occupations_a: 110 (44 GB) dft_pbe0_orbital_occupations_b: 11 (8 GB) dft_pbe0_overlap_matrix: 110000 (30 TB) dft_pbe0_quadrupole: 11 (3 GB) dft_pbe0_zero_broadening_corrected_energy: 11 (3 GB) dft_population_analysis: 11 (19 GB) dft_positions: 11 (7 GB)

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    1
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
1
Average
Average
Average