
Content search and retrieval systems are required to be more efficient due to the data's high volume and complexity. This paper presents a new way to combine Big Data techniques with high-end Natural Language Processing (NLP) models to improve the search procedure's accuracy, relevance, and scalability. We aim to build a system that effectively uses distributed Big Data infrastructure for data processing and cutting-edge NLP models for semantic query interpretation. We evaluate the system over three datasets: Common Crawl (web content), Medical Text Mining, and Amazon Product Reviews, and compare to traditional keyword-based search and TF‐IDF and Word2Vec‐based approaches. The experimental results show that our system achieves better precision, recall, F1-score, and Mean Average Precision (MAP) than previous works at a reasonable query response time. The combination of Big Data and NLP results was much more relevant and contextually aware. This work is a big step toward better content search in many application domains; it makes more accurate and efficient retrieval possible and proposes a personal search experience. The proposed integration of Big Data infrastructure with advanced NLP models enables scalable and semantically rich retrieval, addressing key limitations of existing keyword-centric and shallow semantic search systems.
Big Data, Natural Language Processing, Content Search, Semantic Search, Precision, Information Retrieval
Big Data, Natural Language Processing, Content Search, Semantic Search, Precision, Information Retrieval
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
