ESM-2 embeddings for TCR-Epitope Binding Affinity Prediction Task

This is the accompanying dataset that was generated by the GitHub project: https://github.com/tonyreina/tdc-tcr-epitope-antibody-binding. In that repository I show how to create a machine learning models for predicting if a T-cell receptor (TCR) and protein epitope will bind to each other. A model that can predict how well a TCR bindings to an epitope can lead to more effective treatments that use immunotherapy. For example, in anti-cancer therapies it is important for the T-cell receptor to bind to the protein marker in the cancer cell so that the T-cell (actually the T-cell's friends in the immune system) can kill the cancer cell. [HuggingFace](https://huggingface.co/facebook/esm2_t36_3B_UR50D) provides a "one-stop shop" to train and deploy AI models. In this case, we use Facebook's open-source [Evolutionary Scale Model (ESM-2)](https://github.com/facebookresearch/esm). These embeddings turn the protein sequences into a vector of numbers that the computer can use in a mathematical model. To load them into Python use the Pandas library: import pandas as pd train_data = pd.read_pickle("train_data.pkl") validation_data = pd.read_pickle("validation_data.pkl") test_data = pd.read_pickle("test_data.pkl") The epitope_aa and the tcr_full columns are the protein (peptide) sequences for the epitope and the T-cell receptor, respectively. The letters correspond to the standard amino acid codes. The epitope_smi column is the SMILES notation for the chemical structure of the epitope. We won't use this information. Instead, the ESM-1b embedder should be sufficient for the input to our binary classification model. The tcr column is the CDR3 hyperloop. It's the part of the TCR that actually binds (assuming it binds) to the epitope. The label column is whether the two proteins bind. 0 = No. 1 = Yes. The tcr_vector and epitope_vector columns are the bio-embeddings of the TCR and epitope sequences generated by the Facebook ESM-1b model. These two vectors can be used to create a machine learning model that predicts whether the combination will produce a successful protein binding. From the TDC website: T-cells are an integral part of the adaptive immune system, whose survival, proliferation, activation and function are all governed by the interaction of their T-cell receptor (TCR) with immunogenic peptides (epitopes). A large repertoire of T-cell receptors with different specificity is needed to provide protection against a wide range of pathogens. This new task aims to predict the binding affinity given a pair of TCR sequence and epitope sequence. Weber et al. Dataset Description: The dataset is from Weber et al. who assemble a large and diverse data from the VDJ database and ImmuneCODE project. It uses human TCR-beta chain sequences. Since this dataset is highly imbalanced, the authors exclude epitopes with less than 15 associated TCR sequences and downsample to a limit of 400 TCRs per epitope. The dataset contains amino acid sequences either for the entire TCR or only for the hypervariable CDR3 loop. Epitopes are available as amino acid sequences. Since Weber et al. proposed to represent the peptides as SMILES strings (which reformulates the problem to protein-ligand binding prediction) the SMILES strings of the epitopes are also included. 50% negative samples were generated by shuffling the pairs, i.e. associating TCR sequences with epitopes they have not been shown to bind. Task Description: Binary classification. Given the epitope (a peptide, either represented as amino acid sequence or as SMILES) and a T-cell receptor (amino acid sequence, either of the full protein complex or only of the hypervariable CDR3 loop), predict whether the epitope binds to the TCR. Dataset Statistics: 47,182 TCR-Epitope pairs between 192 epitopes and 23,139 TCRs. References: Weber, Anna, Jannis Born, and María Rodriguez Martínez. “TITAN: T-cell receptor specificity prediction with bimodal attention networks.” Bioinformatics 37.Supplement_1 (2021): i237-i244. Bagaev, Dmitry V., et al. “VDJdb in 2019: database extension, new analysis infrastructure and a T-cell receptor motif compendium.” Nucleic Acids Research 48.D1 (2020): D1057-D1062. Dines, Jennifer N., et al. “The immunerace study: A prospective multicohort study of immune response action to covid-19 events with the immunecode™ open access database.” medRxiv (2020). Dataset License: CC BY 4.0. Contributed by: Anna Weber and Jannis Born. The Facebook ESM-2 model has the MIT license and was published in: * Zeming Lin et al, Evolutionary-scale prediction of atomic-level protein structure with a language model, Science (2023). DOI: 10.1126/science.ade2574 https://www.science.org/doi/10.1126/science.ade2574 HuggingFace has several versions of the trained model. Checkpoint name Number of layers Number of parameters esm2_t48_15B_UR50D 48 15B esm2_t36_3B_UR50D 36 3B esm2_t33_650M_UR50D 33 650M esm2_t30_150M_UR50D 30 150M esm2_t12_35M_UR50D 12 35M esm2_t6_8M_UR50D 6 8M

Keywords

FOS: Computer and information sciences, T-cell, Bioinformatics, Machine learning, Transformer model

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Beta

SDGs Suggest

3. Good health

Beta

SDGs:

3. Good health,

Related to Research communities

Corona Virus Disease

Knowmad Institut

Cancer Research