
final_fastas.tar.gz initial_fastas – Initially downloaded sequences, with redundancy removed both within and across the source databases used to compile the dataset. FASTAS_aligned – Subset of sequences that aligned to at least one reference sequence. FASTAS_with_essential_signature – Sequences containing the required essential signature. FASTAS_with_essential_aligned – Sequences containing the essential signature that also passed the alignment filter. FASTAS_without_extra – Sequences containing the essential signature and no additional non-reference signatures. FASTAS_without_extra_aligned – Sequences containing the essential signature, no additional non-reference signatures, and that also satisfied the alignment criteria. all_code_dataframes.tar.gz Contains the mapping between original FASTA headers and the internal IDs used in the FASTA files provided in this repository.
