Code and data for "Optimization of regulatory DNA with active learning"

Code and data for paper “Optimization of regulatory DNA with active learning” by Shen, Kudla and Oyarzún. data.zip - includes all NK landscapes in csv format. code.zip - includes Python code for reproducing the results of the paper. 1. Code overview. It contains two subfolders on NK landscape and promoter landscape respectively, and one environment file. - `AL.yml`: the environment for all the code AL on NK landscape 2. NK genotype-phenotype landscapes (Figure 1) - `nk_landscape.ipynb`: Generate the NK0-NK3 landscapes and save them in csv files as ground truth landscapes. The NK model is derived from a previous NK simulation in paper [1] from https://github.com/acmater/NK_Benchmarking/blob/master/utils/nk_utils/NK_landscape.py. - `nk_local_landscape.ipynb`: Generate the NK1-NK3 local landscapes. - `nk_tsne.ipynb`: Plot the 2D t-SNE embedding plots of the genotype space, and label the seqeunces according to their phenotype (Figure 1C). - `nk_mlp.ipynb`: Train MLP models on four NK landscapes (Figure 1D). 3. AL on NK genotype-phenotype landscapes (Figure 2) - `AL_NK_pipeline.ipynb`: The active learning pipeline on NK landscape. Different conditions like AL with random sampling and ALDE can be set inthe pipeline. - `NK_benchmarking_ho.ipynb`: One-shot model performance on the NK landscapes with hyperparameter optimization to compare with AL performance. Three optimization methods on one-shot modelling are implemented: random screening (RS), strong-selection weak-mutation (SSWM) and gradient descent (GD). AL on Promoter landscape 4. AL on NK genotype-phenotype landscapes (Figure 3) - `Glu_model.py`, `Ura_model.py`: The code to use the pre-trained promoter landscape. The promoter landscape is derived from the trained transformer structure with a large-scale characterization of promoter expression in paper [2] from https://github.com/1edv/evolution/. - `AL_loop.py`: The main script for active learning pipeline on promoter landscape. - `AL_sampling_methods.py`: The selection methods for the active learning pipeline on promoter landscape. - `AL_selection.py`: The UCB function for the active learning pipeline on promoter landscape, adapted from the paper [3]. - `promoter_benchmarking_ho.ipynb`: One-shot model performance on promoter landscape with hyperparameter optimization to compare with AL performance. Three optimization methods on one-shot modelling are implemented: random screening (RS), strong-selection weak-mutation (SSWM) and gradient descent (GD). 5. Biological sampling and motif information (Figure 4) - `motif_analysis.ipynb`: Conduct motif analysis for the batches sampled by AL. (Figure 4C) - `AL_PFM.py`: Combine the motif information calculation into the UCB function. References [1] Sandhu et al, "Investigating the determinants of performance in machine learning for protein fitness prediction," Protein Science (2025). [2] Vaishnav et al. "The evolution, evolvability and engineering of gene regulatory DNA." Nature (2022).

Related Organizations

University of Edinburgh
United Kingdom

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Related to Research communities

UArctic