publication . Master thesis . 2016

Multi-Hypothesis Parsing of Tabular Data in Comma-Separated Values (CSV) Files

Döhmen, Till;
Open Access English
  • Published: 01 Aug 2016
  • Country: Netherlands
Abstract
htmlabstractTabular data on the web comes in various formats and shapes. Preparing data for data analysis and integration requires manual steps which go beyond simple parsing of the data. The preparation includes steps like correct configuration of the parser, removing of meaningless rows, casting of data types and reshaping of the table structure. The goal of this thesis is the development of a robust and modular system which is able to automatically transform messy CSV data sources into a tidy tabular data structure. The highly diverse corpus of CSV files from the UK open data hub will serve as a basis for the evaluation of the system.
Related Organizations
Download from
Repository CWI Amsterdam
Master thesis . 2016
Provider: NARCIS

[12] ISO/IEC 8859-1:1998. Information technology - 8-bit single-byte coded graphic character sets - Part 1: Latin alphabet No. 1. Standard, International Organization for Standardization, Geneva, CH, April 1998.

[13] ISO/IEC 8859-16:2001. Information technology - 8-bit single-byte coded graphic character sets - Part 16: Latin alphabet No. 10. Standard, International Organization for Standardization, Geneva, CH, July 2001.

[14] Ivan Ermilov, Sören Auer, and Claus Stadler. User-driven semantic mapping of tabular data. In Proceedings of the 9th International Conference on Semantic Systems, pages 105-112. ACM, 2013.

[15] Ranjit Singh, Kawaljeet Singh, et al. A descriptive classification of causes of data quality problems in data warehousing. International Journal of Computer Science Issues, 7(3):41-50, 2010.

Abstract
htmlabstractTabular data on the web comes in various formats and shapes. Preparing data for data analysis and integration requires manual steps which go beyond simple parsing of the data. The preparation includes steps like correct configuration of the parser, removing of meaningless rows, casting of data types and reshaping of the table structure. The goal of this thesis is the development of a robust and modular system which is able to automatically transform messy CSV data sources into a tidy tabular data structure. The highly diverse corpus of CSV files from the UK open data hub will serve as a basis for the evaluation of the system.
Related Organizations
Download from
Repository CWI Amsterdam
Master thesis . 2016
Provider: NARCIS

[12] ISO/IEC 8859-1:1998. Information technology - 8-bit single-byte coded graphic character sets - Part 1: Latin alphabet No. 1. Standard, International Organization for Standardization, Geneva, CH, April 1998.

[13] ISO/IEC 8859-16:2001. Information technology - 8-bit single-byte coded graphic character sets - Part 16: Latin alphabet No. 10. Standard, International Organization for Standardization, Geneva, CH, July 2001.

[14] Ivan Ermilov, Sören Auer, and Claus Stadler. User-driven semantic mapping of tabular data. In Proceedings of the 9th International Conference on Semantic Systems, pages 105-112. ACM, 2013.

[15] Ranjit Singh, Kawaljeet Singh, et al. A descriptive classification of causes of data quality problems in data warehousing. International Journal of Computer Science Issues, 7(3):41-50, 2010.

Powered by OpenAIRE Research Graph
Any information missing or wrong?Report an Issue