publication . Conference object . Preprint . 2013

LZ-Compressed String Dictionaries

Arz, Julian; Fischer, Johannes;
Open Access
  • Published: 03 May 2013
  • Publisher: IEEE
Abstract
We show how to compress string dictionaries using the Lempel-Ziv (LZ78) data compression algorithm. Our approach is validated experimentally on dictionaries of up to 1.5 GB of uncompressed text. We achieve compression ratios often outperforming the existing alternatives, especially on dictionaries containing many repeated substrings. Our query times remain competitive.
Subjects
ACM Computing Classification System: Data_CODINGANDINFORMATIONTHEORYTheoryofComputation_MISCELLANEOUS
free text keywords: Computer science, Pattern recognition, Data structure, Encoding (memory), Compression ratio, Search engine indexing, Lossless compression, Speech recognition, Substring, Data compression, Theoretical computer science, Artificial intelligence, business.industry, business, Uncompressed video, Computer Science - Data Structures and Algorithms

1. P. Boldi, B. Codenotti, M. Santini, and S. Vigna. Ubicrawler: A scalable fully distributed web crawler. Software: Practice & Experience, 34(8):711{726, 2004. [OpenAIRE]

2. N. R. Brisaboa, R. Canovas, F. Claude, M. A. Mart nez-Prieto, and G. Navarro. Compressed string dictionaries. In Proc. SEA, volume 6630 of LNCS, pages 136{ 147. Springer, 2011. [OpenAIRE]

3. P. Ferragina, R. Grossi, A. Gupta, R. Shah, and J. S. Vitter. On searching compressed string collections cache-obliviously. In Proc. PODS, pages 181{190. ACM, 2008. [OpenAIRE]

4. P. Ferragina and G. Navarro. The Pizza&Chili Corpus. http://pizzachili.dcc. uchile.cl/texts/dna/.

5. P. Ferragina and R. Venturini. The compressed permuterm index. ACM Transactions on Algorithms, 7(1):Article No. 10, 2010. [OpenAIRE]

6. S. Gog. Succinct Data Structures Library. http://simongog.github.io/sdsl/.

7. R. Grossi and G. Ottaviano. Fast compressed tries through path decompositions. In Proc. ALENEX, pages 65{74. SIAM Press, 2012. [OpenAIRE]

8. J.-B. Michel et al. Quantitative analysis of culture using millions of digitized books. Science, 331(6014):176{182, 2011.

9. J. I. Munro and V. Raman. Succinct representation of balanced parentheses and static trees. SIAM J. Comput., 31(3):762{776, 2001.

10. D. Okanohara. Tx: Succinct trie data structure. http://code.google.com/p/ tx-trie/.

11. J. Ziv and A. Lempel. Compression of individual sequences via variable length coding. IEEE Trans. Inform. Theory, 24(5):530{536, 1978. [OpenAIRE]

Powered by OpenAIRE Open Research Graph
Any information missing or wrong?Report an Issue
publication . Conference object . Preprint . 2013

LZ-Compressed String Dictionaries

Arz, Julian; Fischer, Johannes;