A New Method for Short Text Compression

Name: A New Method for Short Text Compression
Keywords: Machine Learning, Language Identification, Text Categorization, k-means, Machine learning, K-Means, text compression, Electrical engineering. Electronics. Nuclear engineering, Text Compression, Clustering

Murat Aslanyürek; Altan Mesut

Found an issue? Give us feedback

IEEE Accessarrow_drop_down

IEEE Access

Article . 2023 . Peer-reviewed

License: CC BY NC ND

Data sources: Crossref

IEEE Access

Article . 2023

Data sources: DOAJ

Trakya Üniversitesi Kurumsal Akademik Arşiv Sistemi

Article . 2023

Data sources: Trakya Üniversitesi Kurumsal Akademik Arşiv Sistemi

DBLP

Article

Data sources: DBLP

A New Method for Short Text Compression

descriptionPublicationkeyboard_double_arrow_right Article 01 Jan 2023 Turkey Publisher:Institute of Electrical and Electronics Engineers (IEEE)Journal:IEEE Access, volume 11, pages 141,022-141,035 (eissn: 2169-3536,

Copyright policy )

Authors: Murat Aslanyürek; Altan Mesut;

doi: 10.1109/access.2023.3340436

A New Method for Short Text Compression

- Summary
- Subjects
- Related research
  (2)
- Metrics

Abstract

Short texts cannot be compressed effectively with general-purpose compression methods. Methods developed to compress short texts often use static dictionaries. In order to achieve high compression ratios, using a static dictionary suitable for the text to be compressed is an important problem that needs to be solved. In this study, a method called WSDC (Word-based Static Dictionary Compression), which can compress short texts at a high ratio, and a model that uses iterative clustering to create static dictionaries used in this method are proposed. The number of static dictionaries to be created can vary by running the k-Means clustering algorithm iteratively according to some rules. A method called DSWF (Dictionary Selection by Word Frequency) is also presented to determine which of the created dictionaries can compress the source text at the best ratio. Wikipedia article abstracts consisting of 6 different languages were used as the dataset in the experiments. The developed WSDC method is compared with both general-purpose compression methods (Gzip, Bzip2, PPMd, Brotli and Zstd) and special methods used for compression of short texts (shoco, b64pack and smaz). According to the test results, although WSDC is slower than some other methods, it achieves the best compression ratios for short texts smaller than 200 bytes and better than other methods except Zstd for short texts smaller than 1000 bytes.

Country

Turkey

Related Organizations

Trakya University
Turkey

Keywords

Machine Learning, Language Identification, Text Categorization, k-means, Machine learning, K-Means, text compression, Electrical engineering. Electronics. Nuclear engineering, Text Compression, Clustering, clustering, TK1-9971

2 Research products, page 1 of 1

WSDC software on GitHub
IsRelatedTo
smaz software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	2
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

2

Average

Green

gold

A New Method for Short Text Compression

A New Method for Short Text Compression

2 Research products, page 1 of 1

WSDC software on GitHub

smaz software on GitHub