
This dataset includes all code and data required to reproduce the results of:Greg Schuette, Zhuohan Lao, and Bin Zhang. ChromoGen: Diffusion model predicts single-cell chromatin conformations, 16 July 2024, PREPRINT (Version 1) available at Research Square [https://doi.org/10.21203/rs.3.rs-4630850/v1]File descriptions: chromogen_code.tar.gz contains all code and, as of its upload date, is identical to the corresponding GitHub repo. Note that: Some or all of the code inside chromogen_code.tar.gz/ChromoGen/recreate_results/train/EPCOT/, chromogen_code.tar.gz/ChromoGen/recreate_results/generate_data/EPCOT, and chromogen_code.tar.gz/ChromoGen/src/model/Embedder was adapted from that provided in the original EPCOT paper, Zhang et al. (2023). chromogen_code.tar.gz/ChromoGen/recreate_results/create_figures/Figure_4/domain_boundary_support/PostAnalysisTools.py was adopted from Bintu et al. (2018); our only change was translating the code from Python 2 to Python 3. Several of the Jupyter Notebooks within chromogen_code.tar.gz/ChromoGen/recreate_results/create_figures/ visualize Hi-C and DNase-seq data from Rao et al. (2014) and The ENCODE Project Consortium (2012), respectively, though this dataset excludes the experimental data itself. Seechromogen_code.tar.gz/README.md for instructions on obtaining the data. Dip-C data from Tan et al. (2018) are visualized throughout these notebooks, as well. This dataset excludes the raw Dip-C data, though it does include a post-processed version of the data (see bullets 4-5). The files within chromogen_code.tar.gz/ChromoGen/recreate_results/generate_data/conformations/MDHomopolymer were originally used for Schuette et al. (2023), though we first make those scripts available here (the first author of both works created these files). epcot_final.pt contains the fine-tuned EPCOT parameters. Note that the pre-trained parameters -- not included in this dataset -- came from Zhang et al. (2023) and were used as the starting point for our fine-tuning optimization of these parameters. chromogen.pt contains the complete set of ChromoGen model parameters, including both the relevant fine-tuned EPCOT parameters and all diffusion model parameters. Note that this also contains the fine-tuned EPCOT parameters. conformations.tar.gz contains all conformations analyzed in the manuscript, including the Dip-C conformations formatted in an HDF5 file, all ChromoGen-inferred conformations, and the MD-generated MD homopolymer conformations. Descriptively named subdirectories organize the data. Note that: conformations.tar.gz/conformations/MDHomopolymer/DUMP_FILE.dcd is from Schuette et al. (2023), though it first made available here. conformations.tar.gz/conformations/DipC/processed_data.h5 represents our post-processed version of the 3D genome structures predicted by Dip-C in Tan et al. (2018). outside_data.tar.gz contains two subdirectories: inputs contains our post-processed genome assembly file. Its sole content, hg19.h5, is a post-processed version of the FASTA-formatted hg19 human genome alignment created by Church et al. (2011), which we downloaded from the UCSC genome browser (Kent et al. (2002) and Nassar et al. (2023)). This dataset does NOT include the FASTA file itself. training_data contains the Dip-C conformations post-processed by our pipeline. This is a duplicated version of the file described in bullet 4.2. embeddings.tar.gz contains the sequence embeddings created by our fine-tuned EPCOT model for each region included in the diffusion model's training set. This is really only needed during training. chromogen_code.tar.gz/ChromoGen/README.md and the README.md file on our GitHub repo (identical at the time of this dataset's publication) explain the content of each file in greater detail. They also explain how to use the code to reproduce our results or to make your own structure predictions. You can download and organize all the files in this dataset as intended by running the following in bash:# Download the code and expand the tarball whose contents define the# larger file structure of the repository this dataset is archiving.wget https://zenodo.org/records/14218666/files/chromogen_code.tar.gztar -xvzf chromogen_code.tar.gzrm chromogen_code.tar.gz# Enter the top-level directory of the repo, create the subdirectories# that'll contain the data, and cd to it cd ChromoGenmkdir -p recreate_results/downloaded_data/modelscd recreate_results/downloaded_data# Download all the data in the proper locationswget https://zenodo.org/records/14218666/files/conformations.tar.gz &wget https://zenodo.org/records/14218666/files/embeddings.tar.gz &wget https://zenodo.org/records/14218666/files/outside_data.tar.gz &cd modelswget https://zenodo.org/records/14218666/files/chromogen.pt &wget https://zenodo.org/records/14218666/files/epcot_final.pt &cd ..wait# Untar the three tarballstar -xvzf conformations.tar.gz &tar -xvzf embeddings.tar.gz &tar -xvzf outside_data.tar.gz &wait# Remove the now-unneeded tarballsrm conformations.tar.gz embeddings.tar.gz outside_data.tar.gz
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
