
<script type="text/javascript">
<!--
document.write('<div id="oa_widget"></div>');
document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=undefined&type=result"></script>');
-->
</script>
Today, open data platforms host a wide and heterogeneous catalog of datasets. However, these datasets are often neglected in Machine Learning (ML) and other related tasks. This mainly happens because there are few available open data catalogs specialized in ML applications and because it is often unclear whether Machine Learning algorithms would be adequate and well performing on such datasets. Therefore, several open datasets go unused while they could be leveraged by the ML community to explain, evaluate, and challenge existing methods on real open data. For instance, these real-world data could be used by professors teaching ML courses, by students taking these courses, by researchers testing current and novel ML approaches, and possibly to promote the intersection of open data, ML and public policy. In this talk we will show you how we are tackling this issue working on datasets from data.gouv.fr (DGF), the French open data government platform. We aim to answer the question of what makes a dataset suitable and well performing for Machine Learning tasks by leveraging open source tools. Our goal is to establish a first small empirical assessment of the characteristics of a dataset (size, balance of its categorical variables and so on) that make it a “good fit” for Machine Learning algorithms. Specifically, we first manually select an adequate subset of datasets from DGF. Then we perform a statistic profiling on each of these datasets. Thirdly, we automatically train and validate a set of ML algorithms on them and we cluster the datasets according to their evaluation results. These steps help us to better understand the nature of each dataset and thus determine which ones seem suitable for ML applications. Based on these datasets, and inspired by existing resources, we build the first version of a catalog of open datasets for ML. We hope that this platform will be a first stepping stone towards the reuse of open datasets in Machine Learning contexts.
machine learning, open data
machine learning, open data
citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
views | 82 | |
downloads | 65 |