Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ Biodiversity Informa...arrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Article . 2023
License: CC BY
Data sources: ZENODO
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
Pensoft
Conference object . 2023
Data sources: Pensoft
versions View all 3 versions
addClaim

Preservation Strategies for Biodiversity Data

Authors: Dmitry Mozzherin; Deborah Paul;

Preservation Strategies for Biodiversity Data

Abstract

We are witnessing a fast proliferation of biodiversity informatics projects. The data accumulated by these initiatives often grows rapidly, even exponentially. Most of these projects start small and do not foresee the data achitecture challenges of their future. Organizations may lack the necessary expertise and/or money to strategically address the care and feeding of this expanding data pile. In other cases, individuals with the expertise to address these needs may be present, but lack the power or status or possibly the bandwidth to take effective actions. Over time, the data may increase in size to such an extent that handling and preserving it becomes an almost insurmountable problem. The most common technical challenges include migrating data from one physical data storage to another, organizing backups, providing fast disaster recovery, and preparing data to be accessible for posterity. Some sociotechnical and strategic hurdles noted when trying to address data stewardship include funding, data leadership (Stack and Stadolnik 2018) and vision (or lack thereof), and organizational structure and culture. The biodiversity data collected today will be indispensable for future research, and it is our collective responsibility to preserve it for current and future generations. Some of the most common information loss risk factors are the end of funding, retirement of a researcher, or the departure of a critical researcher or programmer. More risk factors, such as hardware malfunction, hurricanes, tornadoes, and severe magnetic storms, can destroy the data carefully collected by large groups of people. The co-location of original data and backups create a "Library of Alexandria" where a single disaster at this location can lead to permanent data loss and poses an existential threat to the project. Biodiversity data becomes more valuable over time and should survive for several centuries. However, SSD (solid-state drive) and HDD (hard disk drive) storage solutions have an expiration date of only a few years. We propose the following solutions (Fig. 1) to provide long-term data security: Technical tactics Use an immutable file storage for everything that is not entered very recently. Use an immutable file storage for everything that is not entered very recently. Most of the biodiversity "big data" are files that are written once and never changed again. We suggest separating storage into a read-only part and small read/write sections. Data from the read/write section can be moved to the read-only part often, for example, daily. Use a Copy-On-Write file system, such as ZFS (Zettabyte File System). Use a Copy-On-Write file system, such as ZFS (Zettabyte File System). The ZFS file system is widely used in industry and is known for its robustness and error resistance. It allows efficient incremental backups and much faster data transfer than other systems. Regular incremental backups can work even with slow internet connections. ZFS provides real-time data integrity checks and uses powerful tools for data healing. Split data and its backups into smaller chunks. Split data and its backups into smaller chunks. Dividing backups into cost-effective 2–8 terabyte chunks allows running backups using cheap hardware. Assuming that the data is read-only, such data organization always splits the backup into chunks, with hardware costs changing from tens of thousands of dollars (US) to less than two thousand dollars. We recognize that with time data storage costs drop, and larger chunks will be used. Split the data even further to the size of the largest available long-term storage unit (currently an optical M-disc). Split the data even further to the size of the largest available long-term storage unit (currently an optical M-disc). The write-once optical M-DISC is analogous to a Sumerian clay tablet. Data written on such discs does not deteriorate for hundreds of years. This option addresses the need for last resort backups because the storage does not depend on magnetic properties and is impervious to electromagnetic disasters. Optical discs can be easily and cheaply copied and distributed to libraries worldwide. In the future, discs' data can be transferred to a different long-term storage medium. We also trust these discs can be deciphered by those in the future, just like clay tablets. Sociotechnical insights The above example of a comprehensive strategy to preserve data epitomizes "LOCKSS" (lots of copies keep stuff safe) and makes it clear that these copies need to be in multiple media types. Our suggestions here focus on projects that experience data growth pains. Such projects often look to see how others address these data needs. Recently, The Species File Group (SFG) did this exercise to evaluate and address our data growth needs (Mozzherin et al. 2023). We recognize and emphasize here the need for personnel with the knowledge and skills to build, maintain, and evolve robust strategies and infrastructure to make data accessible and preserve it, funding to back the most suitable architectural strategies to do so, and people with expertise in long-term data security to have a seat at the leadership table in our organizations. personnel with the knowledge and skills to build, maintain, and evolve robust strategies and infrastructure to make data accessible and preserve it, funding to back the most suitable architectural strategies to do so, and people with expertise in long-term data security to have a seat at the leadership table in our organizations. We encourage our colleagues to evaluate the status of data leadership at your organizations (Stack and Stadolnik 2018, Kalms 2012). Implementing these suggestions will help ensure the survival of the data and accompanying software for hundreds of years to come.

Related Organizations
Keywords

exponential data growth, information loss, data architecture, data backup, data leadership, data stewardship, ZFS

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    1
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
    OpenAIRE UsageCounts
    Usage byUsageCounts
    visibility views 5
    download downloads 5
  • 5
    views
    5
    downloads
    Powered byOpenAIRE UsageCounts
Powered by OpenAIRE graph
Found an issue? Give us feedback
visibility
download
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
views
OpenAIRE UsageCountsViews provided by UsageCounts
downloads
OpenAIRE UsageCountsDownloads provided by UsageCounts
1
Average
Average
Average
5
5
Green
gold