METL Rosetta datasets

Research datakeyboard_double_arrow_right Dataset 23 Apr 2024Publisher:ZenodoFunded by:NSF | S2I2: Institute for Resea..., NSF | Partnership to Advance Th..., NSF | Collaborative Research: M... +3 projects

Authors: Gelman, Sam; D'Costa, Sameer; Romero, Philip; Gitter, Anthony;

doi: 10.5281/zenodo.10967412 , 10.5281/zenodo.14916528 , 10.5281/zenodo.10967413 , 10.5281/zenodo.14889655

METL Rosetta datasets

- Summary
- Metrics

Abstract

This repository contains the biophysical attributes used to pretrain METL-Local and METL-Global models. We provide raw Rosetta data as well as processed Rosetta datasets that have duplicates, outliers, and NaN values removed. Users of these datasets should cite both METL and Rosetta. The repository also contains packaged conda environment files needed to generate new Rosetta simulation data in the OSPool with our Jupyter notebook. Raw Rosetta data Raw Rosetta data comes in the form of SQLite databases in the .db format. There are separate databases for each of the local datasets as well as the global dataset. Note the GB1-IgG binding raw data only contains the binding scores, whereas the processed GB1-IgG binding dataset listed below contains both the binding and standard scores. The processed dataset was created by combining the raw GB1-IgG binding data with the raw GB1 standard data. Processed Rosetta datasets Each processed Rosetta dataset has its own directory containing the following: The dataset in three formats (.tsv, .db, and .h5 files), all containing the same data A list of PDB files corresponding to the variants in the dataset (pdb_fns.txt) A splits directory containing train, validation, and test splits we used for pretraining Standardization parameters computed on the train set (in the splits directory) Processed Rosetta datasets can be used directly with the main metl GitHub repository to pretrain models. That repository also contains a small example dataset. Our metl-pub GitHub repository has a mapping from the dataset names to these filenames and instructions for reading the files. conda environment files clean_pdb_2025-02-13.tar.gz and metl-sim_2025-02-13.tar.gz are packaged conda environment files from the metl-sim GitHub repository.

Related Organizations

University of Wisconsin–Oshkosh
United States
Morgridge Institute for Research
United States
UNIVERSITY OF WISCONSIN-MADISON
United States

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

1

Average

Funded byView all

NSF| S2I2: Institute for Research and Innovation in Software for High Energy Physics (IRIS-HEP), NSF| Partnership to Advance Throughput Computing (PATh), NSF| Collaborative Research: MFB: Integrating Deep Learning and High-throughput Experimentation to Rapidly Navigate Protein Fitness Landscapes for Non-native Enzyme Catalysis, NIH| Data-driven analysis of protein structure, function, and regulation