Integrative single-cell analysis of cardiogenesis identifies developmental trajectories and non-coding mutations in congenital heart disease (BPNet deep learning DNA sequence models of scATAC-seq data)

Mohamed Ameen; Laksshman Sundaram; Abhimanyu Banerjee; Mengcheng Shen; Soumya Kundu; Surag Nair; Anna Shcherbina; Mingxia Gu; Kitchener D Wilson; Avyay Varadarajan; Nirmal Vadgama; Akshay Balsubramani; Joseph C Wu; Jesse Engreitz; Kyle Farh; Ioannis Karakikes; Kevin C Wang; Thomas Quertermous; William Greenleaf; Anshul Kundaje

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Other ORP type . 2022

License: CC BY

Data sources: Datacite

ZENODO

Other ORP type . 2022

License: CC BY

Data sources: Datacite

ZENODO

Other ORP type . 2022

License: CC BY

Data sources: ZENODO

Integrative single-cell analysis of cardiogenesis identifies developmental trajectories and non-coding mutations in congenital heart disease (BPNet deep learning DNA sequence models of scATAC-seq data)

appsOther research productkeyboard_double_arrow_right Other ORP type 01 Jul 2022Publisher:Zenodo

Authors: Mohamed Ameen; Laksshman Sundaram; Abhimanyu Banerjee; Mengcheng Shen; Soumya Kundu; Surag Nair; Anna Shcherbina; +13 Authors

doi: 10.5281/zenodo.6789180 , 10.5281/zenodo.6789181

Integrative single-cell analysis of cardiogenesis identifies developmental trajectories and non-coding mutations in congenital heart disease (BPNet deep learning DNA sequence models of scATAC-seq data)

- Summary
- Subjects
- Metrics

Abstract

========================== Date: 06/28/2022 Authors: Laksshman Sundaram, Anshul Kundaje Email: laksshman@gmail.com, akundaje@stanford.edu ========================== This archive contains deep learning models trained to map DNA sequences to base-resolution pseudo-bulk scATAC-seq profiles from several cell types derived from scATAC-seq profiling of fetal human hearts. The models are associated with the following preprint/publication Integrative single-cell analysis of cardiogenesis identifies developmental trajectories and non-coding mutations in congenital heart disease Mohamed Ameen, Laksshman Sundaram, Abhimanyu Banerjee, Mengcheng Shen, Soumya Kundu, Surag Nair, Anna Shcherbina, Mingxia Gu, Kitchener D. Wilson, Avyay Varadarajan, Nirmal Vadgama, Akshay Balsubramani, Joseph C. Wu, Jesse Engreitz, Kyle Farh, Ioannis Karakikes, Kevin C Wang, Thomas Quertermous, William Greenleaf, Anshul Kundaje bioRxiv 2022.06.29.498132; doi: https://doi.org/10.1101/2022.06.29.498132 ==================================== Directory structure of this archive ==================================== There is one directory for each cell type. There is also a directory for a global pseudobulk model over all cell types. The file Model_to_cellTypeMapping.txt has a mapping of directory names to the precise cell type names and definitions from the paper. The tab delimited table is reproduced below. ModelName CellType_mnemonic CellType_FullName ecm eCM Early cardiomyocytes acm aCM Atrial cardiomyocytes vcm vCM Ventricular cardiomyocytes oft OFT Outflow tract fb1 FB1 Fibroblast-like cells 1 CFP CFP Cardiac fibroblast progenitors fb2 FB2 Fibroblast-like cells 2 preCF preCF Pre-cardiac fibroblast CF CF Cardiac fibroblast preSMC preSMC Pre-smooth muscle cells smc SMC Smooth muscle cells pc PC Pericytes epc EPC Epicardial cells nc NC Neural crest Endo1_2 Endo1_2 Endocardium/Endocardium like cells lec lEC Lymphatic endothelial cells aec aEC Arterial endothelial cells cap Cap Capillaries vec vEC Venous endothelial cells pseudobulk Pesudobulk model Pesudobulk model For each cell type, model were trained on 5 independent folds (numbered 0-4). There are 3 files for each fold: - <celltype>.<fold>.arch.json : this contains the model architecture in json format - <celltype>.<fold>.weights.data-00000-of-00001 : this contains the actual model weights - <celltype>.<fold>.weights.index: this Keras/TensorFlow file contains weights metadata information ============================================================================================================================ BPNet deep learning models to predict base-resolution, cell-type resolved pseudo-bulk scATAC-seq profiles from DNA sequence ============================================================================================================================ BPNet is a sequence-to-profile convolutional neural network that uses one-hot-encoded DNA sequence (A=[1,0,0,0], C=[0,1,0,0], G=[0,0,1,0], T=[0,0,0,1]) as input to predict single nucleotide-resolution read count profiles from assays of regulatory activity (Avsec, Weilert, et al. 2021; Trevino et al. 2021). The models take in a sequence context of 2,114 bp around the summit of each ATAC-seq peak and predict cluster-specific scATAC-seq pseudo-bulk Tn5 insertion counts at each base pair for the central 1,000 bp. The BPNet model also uses an input Tn5 bias track which is concatenated to the pre-final layer as explained below. Our BPNet model is a higher capacity version of the architecture introduced in (Avsec, Weilert, et al. 2021). The model architecture consists of 8 dilated residual convolution layers, with 500 filters in each layer. At each layer, the Keras Cropping 1D layer is used to clip out the two edges of the sequence, to match the inputs concatenated to the output of each convolution, which naturally trims the 2,114 bp sequence to a final 1,000 bp profile. Each dilated convolutional layer has a kernel width of 21 and the dilation rate is doubled for every convolutional layer starting at 1. The model predicts the base-resolution 1,000 bp length Tn5 insertion count profile using two complementary outputs: (1) the total Tn5 insertion counts over the 1,000 bp region, and (2) a multinomial probability of Tn5 insertion counts at each position in the 1,000 bp sequence. The predicted (expected) count at a specific position is a multiplication of the predicted total counts and the multinomial probability at that position. To predict the total counts in the 1,000 bp window, the output from the last dilated convolutional layer is passed through a GlobalAveragePooling1D layer in Keras. We estimate the “tn5 bias” for the input sequence using the TOBIAS method (Bentsen et al. 2020). This total bias is concatenated with the output of the pooling layer and passed through a Dense layer with 1 neuron to predict total counts. To predict the per-base logits of the multinomial probability profile output, the output from the last dilated residual convolution is appended with per base TOBIAS “tn5 bias” and passed through a final convolution layer with a single kernel and a kernel width of 1 to predict the per-base logits. BPNet uses a composite loss function consisting of a linear combination of a mean squared error (MSE) loss on the log of the total counts and a multinomial negative log-likelihood loss (MNLL) for the profile probability output. We use a weight of [4.9, 4.3, 18.5, 9.8, 8.9, 4.8, 4.6, 4.9, 12.4, 15.4, 4.3, 6.3, 1.4, 2.6, 7.6, 2.3, 16.3, 7.1 & 3.7] for the MSE loss for clusters c0–c20 (c15-c16 combined as one model), and a weight of 1 for the MNLL loss in the linear combination. The MSE loss weight is derived as the median of total counts across all peak regions for each cluster divided by a factor of 10 (Avsec, Weilert, et al. 2021). We used the ADAM optimizer with early stopping patience of 3 epochs. A separate BPNet model was trained on pseudobulk scATAC-seq profiles from each scATAC-seq cluster. We used a 5-fold chromosome hold-out cross-validation framework for training, tuning, and test set performance evaluation. The training, evaluation, and test chromosomes used for each fold are as follows. Test chromosomes: fold 0: [chr1] fold 1: [chr19, chr2] fold 2: [chr3, chr20] fold 3: [chr13, chr6, chr22] fold 4: [chr5, chr16] Validation chromosomes: fold 0: [chr10, chr8] fold 1: [chr1] fold 2: [chr19, chr2] fold 3: [chr3, chr20] fold 4: [chr13, chr6, chr22] For each fold, the the remaining chromosomes that are not in the validation and test set, were used for training. The model’s performance was evaluated using two different metrics for the two output tasks separately. For the total counts predicted for the 1,000 bp region, the model’s performance is computed with the Spearman correlation of predicted counts to actual counts. The profile prediction performance is evaluated using the Jensen-Shannon Distance, which computes the divergence between two probability distributions; in this case, the observed and predicted base-resolution probability profile over each 1,000 bp region. For each cell type, BPNet models were trained, tuned, and evaluated on genomic windows consisting of 1 kb scATAC-seq profiles from (1) signal windows centered at summits of scATAC-seq peaks from the cell type and (2) background windows randomly sampled across the genome such that the number of background windows was 10% of the number of signal windows. The selected signal and background windows were further augmented with upto 10 random jitters (+/- 1000 bp). ================================== Code and data for training models ================================== Description of all code for this paper is at https://github.com/kundajelab/Cardiogenesis_Repo. These models were trained using Keras 2.4 and Tensorlow 2.3.0. The exact code base used to train the models is KerasAC (https://zenodo.org/record/4248179#.X8skj5NKiF0) and it uses seqdataloader (https://zenodo.org/record/3771365#.X8skqZNKiF0) as part of the data processing and model training scripts. The scATAC-seq peak regions for each cell type are at https://github.com/kundajelab/Cardiogenesis_Repo/tree/main/BPNet/peaksets. All coordinates are with respect to the GRCh38 version of the human genome https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/ The BPNet models use a Tn5 bias model for bias correction. For each of the 5 folds, the bias models are available at https://github.com/kundajelab/Cardiogenesis_Repo/tree/main/BPNet/tobias_weights ================================== How to use models for prediction ================================== To load a model for a given celltype: ```python with open("/path/to/models/celltype.arch.json") as f: m = keras.models.model_from_json(f.read()) m.load_weights("/path/to/models/celltype.weights") ``` Each model takes as input 2114 x 4 one-hot encoded DNA sequence, and has 2 outputs: 1) the profile logits for the central 1000 bp 2) log counts for central 1000 bp

Related Organizations

Illumina (Singapore)
Singapore
California Institute of Technology
United States
Stanford University
United States
Cincinnati Children's Hospital Medical Center
United States

Keywords

BPNet, neural network, chromatin accessibility, regulatory DNA, cardiogenesis, genomics, deep learning model, scATAC-seq, gene regulation, congenital heart disease

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average