
Project OverviewThis repository contains the foundational reference data and processed datasets associated with MetaTCR, a computational framework designed to standardize T-cell Receptor (TCR) repertoires and mitigate batch effects in Adaptive Immune Receptor Repertoire sequencing (AIRR-seq) data. MetaTCR addresses the challenge of non-biological variation by constructing a population-scale "Referenced TCR Space." This allows raw TCR repertoires to be converted into fixed-dimensional feature profiles (meta-vectors), enabling robust cross-study comparison and integration. The data provided here allows researchers to reproduce the study's benchmarking results, utilize the pre-trained reference space for new data, and explore the batch correction capabilities of the framework. Dataset Structure and Contents The dataset is organized into a main directory named data, which contains four primary subdirectories corresponding to different data types: reference databases, metadata, processed matrices (metaTCR intermediate results), and antigen-specific data. 1. data/database/ This folder contains the core reference files and pre-computed embeddings used for the analysis. TCR_reference_database.full_legnth.txt: A collection of raw TCR clonotypes assembled from CDR3, TRBV, TRBD, and TRBJ segments. These clonotypes represent a merged and deduplicated set of representative TCRs derived from various datasets. 2. data/metadata/ This folder contains clinical and experimental metadata. datasets_platform_info.csv: A summary file detailing the sequencing platforms and immune repertoire bioinformatics processing pipeline tags for all PBMC datasets. Cohort-specific CSV files (e.g., Dewitt2015.csv, Emerson2017.csv, etc.): These files contain study-specific clinical variables and sample metadata corresponding to each cohort. 3. data/processed_data/ This folder contains the intermediate results of the metaTCR pipeline, organized into cluster information and feature matrices. cluster_centroids/: Contains data related to the clustering of TCR sequences. 1024_primary_centroids.pk: The coordinates of the cluster centroids (k=1024). 1024_primary_labels.pk: The assigned labels for the primary clustering. centroid_mapping_spectral_k96.pk: The mapping file for spectral clustering or dimensionality reduction (k=96). primary_metatcr_mtx/: Contains the processed metaTCR matrices for each dataset. [StudyName].pk (e.g., Emerson2017-HIP.pk, TRACERx.pk, Snyder2017.pk): These Pickle files store the processed metaTCR matrices for each cohort, representing the quantified TCR features across samples. 4. data/tcr_antigen_data/ This folder contains ground-truth data linking TCR sequences to specific antigens. McPAS-TCR_filt_ept_full_deduplicated.tsv: A filtered and deduplicated version of the McPAS-TCR database, mapping TCRs to their known epitopes and associated pathologies. antigen_vj_vdjdb_full.tsv: A comprehensive dataset from VDJdb, containing V/J gene usage and antigen specificity information.
T-cell Receptor (TCR) repertoires
T-cell Receptor (TCR) repertoires
