Reusing Data and Metadata to Create New Metadata through Machine-learning &amp; Other Programmatic Methods

Justin Gosses; ANTHONY R. BUONOMO; Brian A. Thomas; Evan Taylor Yates; Rean W. Yuan; Jenna M. Horn

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Conference object . 2019

License: CC BY

Data sources: Datacite

ZENODO

Conference object . 2019

License: CC BY

Data sources: Datacite

ZENODO

Other literature type . 2019

License: CC BY

Data sources: ZENODO

Reusing Data and Metadata to Create New Metadata through Machine-learning & Other Programmatic Methods

descriptionPublicationkeyboard_double_arrow_right Conference object , Other literature type 10 Dec 2019 English Publisher:Zenodo

Authors: Justin Gosses; ANTHONY R. BUONOMO; Brian A. Thomas; Evan Taylor Yates; Rean W. Yuan; Jenna M. Horn;

doi: 10.5281/zenodo.3674261 , 10.5281/zenodo.3674262

Reusing Data and Metadata to Create New Metadata through Machine-learning & Other Programmatic Methods

- Summary
- Subjects
- Metrics

Abstract

Recent improvements in natural language processing (NLP) enable metadata to be created programmatically from reused original metadata or even the dataset itself. Transfer-learning applied to NLP has greatly improved performance and reduced training data requirements. In this talk, we’ll compare machine-generated metadata to human-generated metadata and discuss characteristics of metadata and data archives that affect suitability for machine-learning reuse of metadata. Where as human-generated metadata is often populated once, populated from the perspective of data supplier, populated by many individuals with different words for the same thing, and limited in length, machine-generated metadata can be updated any number of times, generated from the perspective of any user, constrained to a standardized set of terms that can be evolved over time, and be any length required. Machine-learning generated metadata offers benefits but also additional needs in terms of version control, process transparency, human-computer interaction, and IT requirements. As a successful example, we’ll discuss how a dataset of abstracts and associated human-tagged keywords from a standardized list of several thousand keywords were used to create a machinelearning model that predicted keyword metadata for open-source code projects on code.nasa.gov. We’ll also discuss a less successful example from data.nasa.gov to show how data archive architecture and characteristics of initial metadata can be strong controls on how easy it is to leverage programmatic methods to reuse metadata to create additional metadata.

Related Organizations

National Aeronautics and Space Administration
United States
United States Department of Agriculture
United States

Keywords

open-data, data catalog, metadata, machine-learning, data.nasa.gov, natural language processing, keywords, machine-generated, code.nasa.gov, nasa

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average