Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Presentation . 2021
License: CC BY
Data sources: Datacite
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Other literature type . 2021
License: CC BY
Data sources: ZENODO
versions View all 2 versions
addClaim

This Research product is the result of merged Research products in OpenAIRE.

You have already added 0 works in your ORCID record related to the merged Research product.

csv detective solving some of the mysteries of open data

Authors: Anthony Auffret; Geoffrey Aldebert; Pavel Soriano;

csv detective solving some of the mysteries of open data

Abstract

Over the last few years, the emphasis on data quality has evolved from being a nice to have to an absolute necessity. As already stated in past editions of this venue, the quality of open data is essential to *truly fulfill its promise as a channel of empowerment*. In practical terms, data quality for tabular data entails, among other tasks, checking the structural integrity and schematic consistency of its contents. In order to satisfy these requirements, we need to look into the files content and first determine whether our files are indeed CSV files, and secondly, and more importantly, we want to discover the type of data we have in order to properly validate it and thus evaluate and then hopefully improve the quality of our datasets. In this talk, we present our work on the automatic detection of data types within the columns of CSV files. Going beyond classic computer science data types (float, integer, date), we are also interested in detecting more specific in-domain data categories. Specifically, given a CSV file, we look into its columns and determine whether it contains postal codes, enterprise identifiers, geographic coordinates, days of the week, and so on. We will show how applied to data from the French open data platform that we maintain (data.gouv.fr.), these specific types allow users to better control data in order to join or link datasets between them. We see our work as a first step and important stepping stone towards data integrity and validation checks while also facilitating the discoverability and contextualization of open datasets. Our approaches are evaluated by annotating and testing over thousands of columns extracted from more than 15 000 CSV files found in data.gouv.fr.

Keywords

csv, data quality, data types,

Powered by OpenAIRE graph
Found an issue? Give us feedback