Powered by OpenAIRE graph
Found an issue? Give us feedback
ZENODOarrow_drop_down
ZENODO
Dataset . 2025
License: CC BY
Data sources: Datacite
ZENODO
Dataset . 2025
License: CC BY
Data sources: Datacite
ZENODO
Dataset . 2025
License: CC BY
Data sources: Datacite
ZENODO
Dataset . 2025
License: CC BY
Data sources: Datacite
ZENODO
Dataset . 2024
License: CC BY
Data sources: Datacite
ZENODO
Dataset . 2024
License: CC BY
Data sources: Datacite
ZENODO
Dataset . 2024
License: CC BY
Data sources: Datacite
ZENODO
Dataset . 2025
License: CC BY
Data sources: Datacite
ZENODO
Dataset . 2025
License: CC BY
Data sources: Datacite
versions View all 9 versions
addClaim

This Research product is the result of merged Research products in OpenAIRE.

You have already added 0 works in your ORCID record related to the merged Research product.

Mash Sketch of RefSeq Bacterial Reference Genomes

Abstract

The mash reference that can be downloaded from the mash documentaion is for RefSeq version 70. I do not inherently have a problem with RefSeq version 70, but RefSeq is well past version 200 now. RefSeq updates four times year, and I needed an easy way to create and distribute a mash sketch file of the representative bacterial/prokaryotic genomes.This is intended to be a place to hold the mash sketches from https://github.com/erinyoung/update_mash_dist.The mash sketch file from erinyoung/update_mash_dist requires git lfs to be installed when cloning the repository, which is cumbersome for some users.The update requency is intended to mirror that of RefSeq (i.e. 4 time a year), but... is likely to be less frequent than that.Don't hesitate to submit an issue if this needs to get updated.I do have some prior zenodo repositories (https://zenodo.org/records/10519852 , https://zenodo.org/records/7887021 , and https://zenodo.org/records/7348463 ) which hold the same mash sketch reference, but the refseq version is in the title. I'd rather have one repository that gets updated rather than create new repositories each time.This is how the mash reference file was created: # Step 1. Download Datasets and Dataformat wget https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/v2/linux-amd64/datasets wget https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/v2/linux-amd64/dataformat chmod +x datasets dataformat # Step 2. Download Mash wget https://github.com/marbl/Mash/releases/download/v2.3/mash-Linux64-v2.3.tar tar -xvf mash-Linux64-v2.3.tar # Step 3. Get a list of all the genomes # Note: this also changes how some of the names are represented datasets summary genome taxon bacteria --reference --as-json-lines | \ dataformat tsv genome --fields accession,organism-name --elide-header | \ sed 's/\[//g' | \ sed 's/\]//g' | \ sed 's/["'\'']//g' | \ sed 's/endosymbiont of /endosymbiont_of_/g' > \ ids.txt # Step 4. Download the reference files and sketch them # Note: Since this is done in Github Actions (GA), I need to keep everything below 30G. # The best way to do this is to download the process each reference file individually, and then combine it to the whole. # This obviously does not need to be followed if not under those same limitations. while read line do id=$(echo $line | awk '{print $1}') ge=$(echo $line | awk '{print $2}') if [ ! -n "$ge" ] ; then ge="unknown" ; fi sp=$(echo $line | awk '{print $3}') if [ ! -n "$sp" ] ; then sp="unknown" ; fi datasets download genome accession $id unzip ncbi_dataset.zip cp ncbi_dataset/data/*/*_genomic.fna ${ge}_${sp}_${id}.fasta if [ ! -f RefSeqSketches_${version}.msh ] then mash sketch ${ge}_${sp}_${id}.fasta -o RefSeqSketches_${version} else mash sketch ${ge}_${sp}_${id}.fasta -o ${ge}_${sp}_${id} mv RefSeqSketches_${version}.msh tmp.msh mash paste RefSeqSketches_${version} tmp.msh ${ge}_${sp}_${id}.msh rm tmp.msh ${ge}_${sp}_${id}.msh fi rm ${ge}_${sp}_${id}.fasta rm -rf ncbi_dataset/ rm ncbi_dataset.zip rm README.md rm md5sum.txt done mask sketch sample.fasta RefSeqSketches_.msh > mash_results.txt # These results are unsorted, so many find it useful to sort them. sort -gk3 mash_results.txt > sorted_mash_results.txt The should look like the following: 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_pyogenes_GCF_900475035.1.fasta 0.0116661 0 643/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_dysgalactiae_GCF_016128095.1.fasta 0.0782587 0 107/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_canis_GCF_900636575.1.fasta 0.132399 2.34894e-153 32/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_agalactiae_GCF_001552035.1.fasta 0.164662 1.32611e-72 16/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_castoreus_GCF_000425025.1.fasta 0.174408 2.34302e-58 13/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_didelphis_GCF_000380005.1.fasta 0.182269 8.30736e-49 11/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_uberis_GCF_900475595.1.fasta 0.186761 5.62934e-44 10/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_iniae_GCF_000831485.1.fasta 0.191731 3.33152e-39 9/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_ictaluri_GCF_000188015.2.fasta 0.197292 1.75608e-34 8/1000 2024CK-00429-UT-M03999-240412_contigs.fa Streptococcus_phocae_GCF_001302265.1.fasta 0.203604 2.46548e-30 7/1000

Related Organizations
Keywords

MASH, RefSeq, Sketch, Prokaryotes

  • BIP!
    Impact byBIP!
    citations
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
citations
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average