Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Software . 2025
Data sources: ZENODO
ZENODO
Software . 2025
Data sources: Datacite
ZENODO
Software . 2025
Data sources: Datacite
versions View all 2 versions
addClaim

Materials Dataset: A-Curated-High-Fidelity-Carbide-Materials-Dataset-HF-CCD-and-Pipeline

Authors: Wu, Chi Hsing; Chen, Kai Siang;

Materials Dataset: A-Curated-High-Fidelity-Carbide-Materials-Dataset-HF-CCD-and-Pipeline

Abstract

HF-CCD is a high-fidelity curated carbide materials dataset with machine-learning–ready descriptors, featuring a fully reproducible pipeline from raw data fetching to structural cleaning, descriptor generation, and statistical quality control. This repository provides HF-CCD, a curated high-fidelity carbide materials dataset derived from Materials Project entries, together with a complete end-to-end processing pipeline including: Automated bulk data fetching Descriptor generation Structure and metadata cleaning Quality control (QC) with statistical anomaly detection Ready-to-use machine-learning feature tables ⚠️ Important:This repository does not redistribute raw CIF files from Materials Project due to redistribution restrictions.The dataset provided is an independent cleaned, derived dataset, suitable for ML/DS research and curated through our pipeline. Pipeline Overview The HF-CCD pipeline consists of four main stages: 1. Data Fetching Query MP API Download metadata (formation energy, band gap, density, space group…) Store JSON metadata only (no CIF redistribution) 2. Structure & Materials Cleaning Remove incomplete entries Check for missing physical quantities Normalize chemical formula notation Remove duplicated structures Enforce physical boundary checks 3. Descriptor Generation Using advanced_descriptors.py, the pipeline computes: Atomic-level descriptors Bonding descriptors Geometric descriptors Coordination and packing metrics Density-based descriptors This produces a machine-learning-ready table. 4. Quality Control (QC) Using plot_data_quality.py: Outlier detection via IQR Outlier detection via Isolation Forest Distribution analyses (boxplots) Global dataset quality summary dashboard All QC plots are saved to PNG/. HF-CCD Dataset — Data Processing Pipeline Explanation 1. Source Data: cleaned_materials.csv (7 columns) This file contains the fundamental material attributes downloaded from the Materials Project database.It includes only high-level metadata, without structural descriptors. Columns (7): Column Description id Materials Project ID family Chemical family (e.g., carbide, nitride) formula Reduced chemical formula cif_file Structure file name (CIF) band_gap Electronic band gap (eV) formation_energy Formation energy per atom (eV) density Density (g/cm³) This is the “raw dataset" before structural feature computation. 2. Structure-Based Feature Generation The script advanced_descriptors.py reads the CIF files and computes local, structural, and bonding descriptors. A. Local Environment Features (VoronoiNN) Extracted using pymatgen.analysis.local_env.VoronoiNN(). Feature Meaning avg_CN Average coordination number std_CN Variation of coordination min_CN Minimum coordination max_CN Maximum coordination B. Bond-Length Features Using neighbor search with a 4.0 Å cutoff. Feature Meaning min_bond Shortest neighbor distance mean_bond Average bond length std_bond Bond length variation max_bond Longest neighbor distance 🧩 These features describe atomic packing and bonding rigidity. C. Structure Geometry Features Derived from CIF lattice & space group. Feature Description volume_per_atom Volume normalized by number of atoms n_atoms Number of atoms in the primitive cell n_elements Number of unique element types lattice_a, lattice_b, lattice_c Lattice constants lattice_anisotropy Std / mean of (a, b, c) spacegroup International space group number 3. Final Output: advanced_features.csv (20+ columns → 58 features after expansion) example: python plot_correlation_heatmap.py --input ..\data\advanced_features.csv --output ..\output\figures\correlation_heatmap.png --style all This is the file used for: Correlation heatmap Clustering analysis Feature grouping ML model training QC statistics Zenodo dataset Why does it have so many features? Because each category expands raw structural information into vectorized descriptors, capturing: Atomic coordination environments Bond-length distributions Lattice geometry Symmetry Stoichiometric richness These features dramatically improve ML model performance for predicting material properties. Usage Fetch Materials Project data python scripts/materials_fetcher.py --output data/materials_metadata.json Clean dataset python scripts/clean_carbon.py --input data/materials_metadata.json--output data/hfccd_clean.csv Generate descriptors python scripts/advanced_descriptors.py--input data/hfccd_clean.csv--output data/hfccd_features.csv Run QC visualization python scripts/plot_data_quality.py--input data/hfccd_features.csv--output PNG/hfccd_qc.png--style all Citation If you use HF-CCD in academic work, please cite: Wu, J.-H. (2025). A Curated High-Fidelity Carbide Materials Dataset (HF-CCD) and Pipeline. https://orcid.org/0009-0001-3396-6835 https://doi.org/10.5281/zenodo.17851432 Legal Notice This repository does not include, redistribute, or republish raw CIF files or any protected content from Materials Project. Only derived numerical datasets and descriptors are released. Users must supply their own MP API key to fetch raw structures for personal research use. License MIT License — free for academic and commercial use.

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average