
This repository compiles the core resources used to construct the MassNet dataset, including: 1) FASTA sequence files for each species, used for database searching; 2) Standardized database search workflows based on FragPipe and Sage engines for unified processing of raw DDA-MS data and high-confidence peptide identification. Additionally, the repository provides the following data resources and supporting tools for downstream AI tasks: 1) Retention time (RT) prediction task: training and validation datasets constructed from FragPipe and Sage results, along with corresponding RT prediction model outputs; 2) Peptide-spectrum match (PSM) rescoring task: PSM datasets for training and evaluation results; Dataset construction tools: complete code and documentation for generating the above task-specific datasets. For detailed model training procedures and usage instructions, please refer to the following official repositories:DeepLC: https://github.com/CompOmics/DeepLCDDA-BERT: https://github.com/guomics-lab/DDA-BERT All resources provided in this repository enable full reproduction of the core experimental and analytical results reported in the manuscript "MassNet: billion-scale AI-ready mass spectrometry corpus enabling scalable deep Learning in proteomics".
