
A large-scale dataset of experimentally validated lipid contact residues derived from experimentally determined structures in the Protein Data Bank. 100% EXPERIMENTAL LABELS - NO COMPUTATIONAL DATABASE DEPENDENCIES Dataset Statistics (v2.0.0) Proteins: 4,704 Total residues: 8,055,325 Contact residues: 80,439 Contact rate: 1.00% Sequence clusters: 813 (30% identity) Lipid codes recognized: 117 Train/Validation/Test Splits Train: 2,578 proteins, 4,907,696 residues Val: 1,051 proteins, 1,403,838 residues Test: 1,075 proteins, 1,743,791 residues Key Features Labels derived 100% from experimentally resolved lipids in PDB structures 4.0 Angstrom all-atom heavy-atom distance cutoff 4,704 proteins across all membrane protein classes Cluster-aware splits prevent data leakage Fully reproducible from public PDB data GitHub: https://github.com/omagebright/MPLID
Funded by São Paulo Research Foundation (FAPESP) grants 2023/02691-2 and 2025/23708-6.
PDB, machine learning, lipid interactions, structural biology, membrane proteins, protein structure, training dataset, crystallography
PDB, machine learning, lipid interactions, structural biology, membrane proteins, protein structure, training dataset, crystallography
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
