
An efficient full-text search is achieved by indexing the raw data with an additional 20 to 30 percent storagecost. In the context of Big Data, this additional storage space is huge and introduces challenges to entertainfull-text search queries with good performance. It also incurs overhead to store, manage, and update the largesize index. In this paper, we propose and evaluate a method to minimize the index size to offer full-text searchover Big Data using an automatic extractive-based text summarization method. To evaluate the effectivenessof the proposed approach, we used two real-world datasets. We indexed actual and summarized datasets usingApache Lucene and studied average simple overlapping, Spearman’s rho correlation, and average rankingscore measures of search results obtained using different search queries. Our experimental evaluation showsthat automatic text summarization is an effective method to reduce the index size significantly. We obtained amaximum of 82% reduction in index size with 42% higher relevance of the search results using the proposedsolution to minimize the full-text index size.
FOS: Computer and information sciences, Web Data Extraction, FOS: Political science, Trajectory Data Mining and Analysis, Search engine indexing, FOS: Law, Automatic summarization, Web Data Extraction and Crawling Techniques, Big data, Context (archaeology), Artificial Intelligence, Information retrieval, Raw data, Data mining, Political science, Biology, Paleontology, Computer science, Programming language, Automatic Keyword Extraction from Textual Data, Top-k Query Processing, Overhead (engineering), World Wide Web, Data Records Mining, Operating system, Computer Science, Physical Sciences, Signal Processing, Information Retrieval, Relevance (law), Textual Data, Law, Information Systems, Index (typography)
FOS: Computer and information sciences, Web Data Extraction, FOS: Political science, Trajectory Data Mining and Analysis, Search engine indexing, FOS: Law, Automatic summarization, Web Data Extraction and Crawling Techniques, Big data, Context (archaeology), Artificial Intelligence, Information retrieval, Raw data, Data mining, Political science, Biology, Paleontology, Computer science, Programming language, Automatic Keyword Extraction from Textual Data, Top-k Query Processing, Overhead (engineering), World Wide Web, Data Records Mining, Operating system, Computer Science, Physical Sciences, Signal Processing, Information Retrieval, Relevance (law), Textual Data, Law, Information Systems, Index (typography)
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 5 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 10% | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Top 10% |
