publication . Other literature type . Article . 2018

Prevention of data duplication for high throughput sequencing repositories

Gabdank, Idan; Chan, Esther T; Davidson, Jean M; Hilton, Jason A; Davis, Carrie A; Baymuradov, Ulugbek K; Narayanan, Aditi; Onate, Kathrina C; Graham, Keenan; Miyasato, Stuart R; ...
  • Published: 01 Feb 2018
  • Publisher: Oxford University Press (OUP)
Abstract
Abstract Prevention of unintended duplication is one of the ongoing challenges many databases have to address. Working with high-throughput sequencing data, the complexity of that challenge increases with the complexity of the definition of a duplicate. In a computational data model, a data object represents a real entity like a reagent or a biosample. This representation is similar to how a card represents a book in a paper library catalog. Duplicated data objects not only waste storage, they can mislead users into assuming the model represents more than the single entity. Even if it is clear that two objects represent a single entity, data duplication opens th...
Subjects
free text keywords: General Biochemistry, Genetics and Molecular Biology, General Agricultural and Biological Sciences, Information Systems, DNA sequencing, Data deduplication, Data mining, computer.software_genre, computer, Computer science, Original Article
Related Organizations
Funded by
NIH| A Data Coordinating Center for ENCODE
Project
  • Funder: National Institutes of Health (NIH)
  • Project Code: 1U24HG009397-01
  • Funding stream: NATIONAL HUMAN GENOME RESEARCH INSTITUTE
,
NIH| A Data Coordinating Center for ENCODE
Project
  • Funder: National Institutes of Health (NIH)
  • Project Code: 3U41HG006992-04S1
  • Funding stream: NATIONAL HUMAN GENOME RESEARCH INSTITUTE

1 Barrett T., Wilhite S.E., Ledoux P. (2013) NCBI GEO: archive for functional genomics data sets–update. Nucleic Acids Res., 41, D991–D995.23193258 [OpenAIRE] [PubMed]

2 Barrett T., Troup D.B., Wilhite S.E. (2011) NCBI GEO: archive for functional genomics data sets–10 years on. Nucleic Acids Res., 39, D1005–D1010.21097893 [OpenAIRE] [PubMed]

3 Hong E.L., Sloan C.A., Chan E.T. (2016) Principles of metadata organization at the ENCODE data coordination center. Database, 2016, 1–10.

4 Bernstein B.E., Stamatoyannopoulos J.A., Costello J.F. (2010) The NIH Roadmap Epigenomics Mapping Consortium. Nat. Biotechnol., 28, 1045–1048.20944595 [OpenAIRE] [PubMed]

5 Washington N.L., Stinson E.O., Perry M.D. (2011) The modENCODE Data Coordination Center: lessons in h arvesting comprehensive experimental details. Database, 2011, bar023.21856757 [OpenAIRE] [PubMed]

6 Sloan C.A., Chan E.T., Davidson J.M. (2016) ENCODE data at the ENCODE portal. Nucleic Acids Res., 44, D726–D732.26527727 [OpenAIRE] [PubMed]

7 McMurry J.A., Juty N., Blomberg N. (2017) Identifiers for the 21st century: how to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data. PLoS Biol., 15, e2001414.28662064 [OpenAIRE] [PubMed]

8 Rosenbloom K.R., Dreszer T.R., Long J.C. (2012) ENCODE whole-genome data in the UCSC Genome Browser: update 2012. Nucleic Acids Res., 40, D912–D917.22075998 [OpenAIRE] [PubMed]

9 Björling E., Uhlén M. (2008) Antibodypedia, a portal for sharing antibody and antigen validation data. Mol. Cell. Proteomics, 7, 2028–2037.18667413 [OpenAIRE] [PubMed]

10 Rivest R. (1992) The MD5 Message-Digest Algorithm, doi: 10.17487/RFC1321.

11 Kivioja T., Vähärautio A., Karlsson K. (2011) Counting absolute numbers of molecules using unique molecular identifiers. Nat. Methods, 9, 72–74.22101854 [OpenAIRE] [PubMed]

12 Jensen M.A., Ferretti V., Grossman R.L. (2017) The NCI Genomic Data Commons as an engine for precision medicine. Blood, 130, 453–459.28600341 [OpenAIRE] [PubMed]

13 Wilkinson M.D., Dumontier M., Aalbersberg I.J.J. (2016) The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data, 3, 160018.26978244 [OpenAIRE] [PubMed]

Abstract
Abstract Prevention of unintended duplication is one of the ongoing challenges many databases have to address. Working with high-throughput sequencing data, the complexity of that challenge increases with the complexity of the definition of a duplicate. In a computational data model, a data object represents a real entity like a reagent or a biosample. This representation is similar to how a card represents a book in a paper library catalog. Duplicated data objects not only waste storage, they can mislead users into assuming the model represents more than the single entity. Even if it is clear that two objects represent a single entity, data duplication opens th...
Subjects
free text keywords: General Biochemistry, Genetics and Molecular Biology, General Agricultural and Biological Sciences, Information Systems, DNA sequencing, Data deduplication, Data mining, computer.software_genre, computer, Computer science, Original Article
Related Organizations
Funded by
NIH| A Data Coordinating Center for ENCODE
Project
  • Funder: National Institutes of Health (NIH)
  • Project Code: 1U24HG009397-01
  • Funding stream: NATIONAL HUMAN GENOME RESEARCH INSTITUTE
,
NIH| A Data Coordinating Center for ENCODE
Project
  • Funder: National Institutes of Health (NIH)
  • Project Code: 3U41HG006992-04S1
  • Funding stream: NATIONAL HUMAN GENOME RESEARCH INSTITUTE

1 Barrett T., Wilhite S.E., Ledoux P. (2013) NCBI GEO: archive for functional genomics data sets–update. Nucleic Acids Res., 41, D991–D995.23193258 [OpenAIRE] [PubMed]

2 Barrett T., Troup D.B., Wilhite S.E. (2011) NCBI GEO: archive for functional genomics data sets–10 years on. Nucleic Acids Res., 39, D1005–D1010.21097893 [OpenAIRE] [PubMed]

3 Hong E.L., Sloan C.A., Chan E.T. (2016) Principles of metadata organization at the ENCODE data coordination center. Database, 2016, 1–10.

4 Bernstein B.E., Stamatoyannopoulos J.A., Costello J.F. (2010) The NIH Roadmap Epigenomics Mapping Consortium. Nat. Biotechnol., 28, 1045–1048.20944595 [OpenAIRE] [PubMed]

5 Washington N.L., Stinson E.O., Perry M.D. (2011) The modENCODE Data Coordination Center: lessons in h arvesting comprehensive experimental details. Database, 2011, bar023.21856757 [OpenAIRE] [PubMed]

6 Sloan C.A., Chan E.T., Davidson J.M. (2016) ENCODE data at the ENCODE portal. Nucleic Acids Res., 44, D726–D732.26527727 [OpenAIRE] [PubMed]

7 McMurry J.A., Juty N., Blomberg N. (2017) Identifiers for the 21st century: how to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data. PLoS Biol., 15, e2001414.28662064 [OpenAIRE] [PubMed]

8 Rosenbloom K.R., Dreszer T.R., Long J.C. (2012) ENCODE whole-genome data in the UCSC Genome Browser: update 2012. Nucleic Acids Res., 40, D912–D917.22075998 [OpenAIRE] [PubMed]

9 Björling E., Uhlén M. (2008) Antibodypedia, a portal for sharing antibody and antigen validation data. Mol. Cell. Proteomics, 7, 2028–2037.18667413 [OpenAIRE] [PubMed]

10 Rivest R. (1992) The MD5 Message-Digest Algorithm, doi: 10.17487/RFC1321.

11 Kivioja T., Vähärautio A., Karlsson K. (2011) Counting absolute numbers of molecules using unique molecular identifiers. Nat. Methods, 9, 72–74.22101854 [OpenAIRE] [PubMed]

12 Jensen M.A., Ferretti V., Grossman R.L. (2017) The NCI Genomic Data Commons as an engine for precision medicine. Blood, 130, 453–459.28600341 [OpenAIRE] [PubMed]

13 Wilkinson M.D., Dumontier M., Aalbersberg I.J.J. (2016) The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data, 3, 160018.26978244 [OpenAIRE] [PubMed]

Powered by OpenAIRE Open Research Graph
Any information missing or wrong?Report an Issue
publication . Other literature type . Article . 2018

Prevention of data duplication for high throughput sequencing repositories

Gabdank, Idan; Chan, Esther T; Davidson, Jean M; Hilton, Jason A; Davis, Carrie A; Baymuradov, Ulugbek K; Narayanan, Aditi; Onate, Kathrina C; Graham, Keenan; Miyasato, Stuart R; ...