Compression of Textual Column-Oriented Data

descriptionPublicationkeyboard_double_arrow_right Article 01 Jan 2018Publisher:Central Library of the Slovak Academy of SciencesJournal:Computing and Informatics, volume 37, pages 405-423 (eissn: 2585-8807,

Copyright policy )

Authors: Vinicius Fulber-Garcia; Sérgio Luis Sardi Mergen;

doi: 10.4149/cai_2018_2_405

Compression of Textual Column-Oriented Data

- Summary
- Subjects
- Metrics

Abstract

Column-oriented data are well suited for compression. Since values of the same column are stored contiguously on disk, the information entropy is lower if compared to the physical data organization of conventional databases. There are many useful light-weight compression techniques targeted at specific data types and domains, like integers and small lists of distinct values, respectively. However, compression of textual values formed by skewed and high-cardinality words is usually restricted to variations of the LZ compression algorithm. So far there are no empirical evaluations that verify how other sophisticated compression methods address columnar data that store text. In this paper we shed a light on this subject by revisiting concepts of those algorithms. We also analyse how they behave in terms of compression and speed when dealing with textual columns where values appear in adjacent positions.

Related Organizations

Universidade Federal de Santa Maria
Brazil

Keywords

PPM, entropy encoding, DSM, LZ, BWT, 68P30, column-oriented databases, Compression, PAX, NSM

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

gold

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering