Creating a 7,000 strains genotype-phenotype dataset of E. coli and antimicrobial resistance phenotypes

Description This Zenodo repository contains the data (except for the input fastq files available on SRA and intermediary files generated during the variant calling process) and code to recapitulate the study from https://doi.org/10.57844/arcadia-d2cf-ebe5 and the associated GitHub repository, where the code, pipelines, and analysis are described in more detail. Work summary In this work, we established a framework for compiling large genotype-phenotype datasets and produced a large-scale dataset of more than 7,000 E. coli strains and antimicrobial resistance phenotypes. We leveraged the genetic information and antimicrobial resistance (AMR) phenotype data available for the bacterium Escherichia coli to construct our dataset and took advantage of the existing knowledge about genetic variations and AMR phenotypes to validate our approach and dataset. We performed variant calling and compiled a genotype-phenotype dataset for more than 7,000 E. coli strains. Briefly, variant calling consists of identifying all genetic variations and their associated genotypes in a population compared to a reference genome. This is performed by aligning sequencing reads for each strain of the population against a reference genome, then identifying polymorphic regions in the population, and finally characterizing variants and their genotypes at each of these polymorphic regions. We have generated a dataset that successfully revealed significant genetic diversity and identified 2.4 million variants. By focusing on non-silent variants within genes associated with AMR, we confirmed the dataset's accuracy. We hope this study is a foundational resource for conducting large-scale genotype-phenotype studies that will offer valuable insights for genetics investigations, informing the development of treatments and prevention strategies for AMR. This resource is invaluable for microbiologists and epidemiologists seeking to understand AMR mechanisms and improve genotype-phenotype predictions in pathogenic E. coli outbreaks. Additionally, it's of particular interest to geneticists and evolutionary biologists, providing a dataset to develop strategies for studying genetic interactions and broader applications in phenotype-phenotype predictions and phylogenetic research. Data organization Data are organized in the compressed folder. Briefly, they’re divided into two main folders. The first folder, dataset_generation, includes the code and information necessary to build the genotype dataset and perform the variant calling. It covers major steps like the generation of the reference pangenome used for variant calling, the variant calling pipeline applied to each of the 7,000 strains, the filtering of false positive variants, and the annotation of the variants. The second section, dataset_analysis, includes the code and information used to process and analyze the dataset and generate figures for the Pub (https://doi.org/10.57844/arcadia-d2cf-ebe5). It includes the preliminary analysis of AMR phenotypes within the population and the analysis of variants regarding known AMR phenotypes. Files description The following table provides a list and description of the different files and their locations. File name Location Description variant_calling_pipeline dataset_generation/scripts/ Snakefile: performs variant calling from raw paired-end sequencing files and generate one vcf.gz file per sample snakemake_ECOR72_annotation Snakefile: performs Prokka annotation on inputs whole genome fastq files ECOR72_and_DP_threshold_analysis.Rmd R markdown: analyses the coverage of known present and absebt loci in the ECOR population average_coverage_41.csv dataset_generation/data/dp_threshold/ Pangenome loci read coverage information for 40 ECOR strains average_coverage_last32.csv Pangenome loci read coverage information for 32 ECOR strains whole_pan_ecor_presence_absence.csv Reformated pangenome loci presence-absence in ECOR strains pangenome_genomes_SRA_GCA.csv Correspondance table between ECOR72 strains genome names and raw sequencing files SRA accession number index_loci_pangenome_good.txt List of indexed positions in the pangenome list_ecor_txtfiles.txt List of txt files (containing the DP information per nucleotide) to use - This corresponds to the files for each 72 ECOR strains ECOR72_SRA_and_assembly_accessions.csv dataset_generation/data/ List of Genome accession number and the SRA accession number of the associated sequencing files for the 72 ECOR strains sample_list_SRA.csv List of SRA accession numbers of the E. coli strains used for variant calling gene_presence_absence.csv dataset_generation/results/pangenome_cds/ Roary output of presence-absence of the pangenome cds loci in the ECOR72 strains genes.gff Annotation file of the pangenome cds sequences (Prokka output) pangenome_cds.fa Roary output cds_pangenome sequencing file summary_statistics.txt Roary statistics output of creations of the cds pangenome roary_output Roary output folder IGR_presence_absence.csv dataset_generation/results/pangenome_igr/ Piggy output of presence-absence of the pangenome igr loci in the ECOR72 strains pangenome_igr.fasta Piggy output igr_pangenome sequencing file piggy_output Piggy output folder whole_pangenome.fasta dataset_generation/results/pangenome_whole/ whole pangenome sequences annot_summary_filtered.html dataset_generation/results/vcf/ Summary of snpEff annotations annotated_output.vcf.gz snpEff annotated vcf file annotated_output.vcf.gz.csi indexed annotated vcf file filtered_output.vcf.gz filtered vcf file (removed low coveraged and low quality variants) filtered_output.vcf.gz.csi indexed filtered vcf files output.non_silent.vcf.gz vcf file containing only the nonsilent variants in the pangenome cds loci merged_output_listN.vcg.gz intermediary vcf files of 1000 merged strains vcf - these intermediary merged files are numbered from 1 to 7 merged_output_all.vcf.gz final vcf.gz files of all merged vcf files in this study List_N_merging.txt dataset_generation/data/vcf_merging/ List of the 1000 vcf.gz files to be merged together. There are 7 lists, numbered from 1 to 7 ecor72_array.txt dataset_generation/results/ecor72_DP/ Consolidate DP information per nucleotide for each ECOR strain variants_pos.tsv dataset_analysis/data/variant_analysis/ List of all the variants found in the population and identified by their locus and position within the locus allele_freqs.txt Variant frequency informations variants_non_silent_pos.tsv List of all the non-silent variants found in the population and identified by their locus and position within the locus allele_non_silent_freqs.txt Non-silent variant frequency informations cds_eggNog.tsv eggNog output file of the pangenome annotation COG_functional_categories.csv Correspondance between COG functional categories and higher-order annotation BVBRC_genome_May31.csv dataset_analysis/data/dataset_analysis List of E. coli with available genomes as reported in BCBRV database BVBRC_genome_amr_May31.csv E. coli antimicrobial resistance information available in BCBRV database antibiotic_class.csv Antibiotic name and Antibiotic class information resistance_output.non_silent.vcf.gz dataset_analysis/data/antimicrobial_resistance_analysis vcf.gz file of the loci expected to be associated with antimicrobial resistance antibiotic_resistance_freq.csv Frequency information for the non-silent variant in the selected antimicrobial genes SRA_to_genome_name.csv correspondence between strain SRA accession number and genome name (as reported in BVBRC) Dataset_metainfo_AMR_analysis.Rmd dataset_analysis/scripts R markdown: conducts the characterization of the population and analysis of the AMR phenotype distribution Variant_population_analysis.Rmd R markdown: conducts the analysis and investigation of identified variants in the population Antimicrobial_resistance_investigation.Rmd R markdown: conducts the antimicrobial resistance investigation

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average