Big Data Full-Text Search Index Minimization Using Text Summarization

descriptionPublicationkeyboard_double_arrow_right Article , Other literature type 17 Jun 2021Publisher:Kaunas University of Technology (KTU)Journal:Information Technology and Control, volume 50, pages 375-389 (issn: 1392-124X, eissn: 2335-884X,

Copyright policy )

Authors: Waheed Iqbal; Waqas Ilyas Malik; Faisal Bukhari; Khaled Mellouli; Zubiar Nawaz;

doi: 10.5755/j01.itc.50.2.25470 , 10.60692/3w81m-1dx45 , 10.60692/8733b-40064

Big Data Full-Text Search Index Minimization Using Text Summarization

- Summary
- Subjects
- Metrics

Abstract

An efficient full-text search is achieved by indexing the raw data with an additional 20 to 30 percent storagecost. In the context of Big Data, this additional storage space is huge and introduces challenges to entertainfull-text search queries with good performance. It also incurs overhead to store, manage, and update the largesize index. In this paper, we propose and evaluate a method to minimize the index size to offer full-text searchover Big Data using an automatic extractive-based text summarization method. To evaluate the effectivenessof the proposed approach, we used two real-world datasets. We indexed actual and summarized datasets usingApache Lucene and studied average simple overlapping, Spearman’s rho correlation, and average rankingscore measures of search results obtained using different search queries. Our experimental evaluation showsthat automatic text summarization is an effective method to reduce the index size significantly. We obtained amaximum of 82% reduction in index size with 42% higher relevance of the search results using the proposedsolution to minimize the full-text index size.

Related Organizations

University of the Punjab
Pakistan
Prince Sultan University
Saudi Arabia

Keywords

FOS: Computer and information sciences, Web Data Extraction, FOS: Political science, Trajectory Data Mining and Analysis, Search engine indexing, FOS: Law, Automatic summarization, Web Data Extraction and Crawling Techniques, Big data, Context (archaeology), Artificial Intelligence, Information retrieval, Raw data, Data mining, Political science, Biology, Paleontology, Computer science, Programming language, Automatic Keyword Extraction from Textual Data, Top-k Query Processing, Overhead (engineering), World Wide Web, Data Records Mining, Operating system, Computer Science, Physical Sciences, Signal Processing, Information Retrieval, Relevance (law), Textual Data, Law, Information Systems, Index (typography)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	5
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%

Found an issue? Give us feedback

5

Top 10%

Average

Top 10%

gold

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Related to Research communities

Knowmad Institut