Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtualization

descriptionPublicationkeyboard_double_arrow_right Article , Other literature type 31 Mar 2013Publisher:Academy and Industry Research Collaboration Center (AIRCC)Journal:International Journal on Web Service Computing, volume 4, pages 19-37 (issn: 2230-7702, eissn: 0976-9811,

Copyright policy )

Authors: Hussein Al-Bahadili; Hamzah Qtishat; Reyadh S. Naoum;

doi: 10.5121/ijwsc.2013.4102 , 10.5281/zenodo.3571797 , 10.5281/zenodo.3571798

Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtualization

- Summary
- Subjects
- Metrics

Abstract

A Web crawler is an important component of the Web search engine. It demands large amount of hardware resources (CPU and memory) to crawl data from the rapidly growing and changing Web. So that the crawling process should be a continuous process performed from time-to-time to maintain up-to-date crawled data. This paper develops and investigates the performance of a new approach to speed up the crawling process on a multi-core processor through virtualization. In this approach, the multi-core processor is divided into a number of virtual-machines (VMs) that can run in parallel (concurrently) performing different crawling tasks on different data. It presents a description, implementation, and evaluation of a VM-based distributed Web crawler. In order to estimate the speedup factor achieved by the VM-based crawler over a non-virtualization crawler, extensive crawling experiments were carried-out to estimate the crawling times for various numbers of documents. Furthermore, the average crawling rate in documents per unit time is computed, and the effect of the number of VMs on the speedup factor is investigated. For example, on an Intel® Core™ i5-2300 CPU @2.80 GHz and 8 GB memory, a speedup factor of ~1.48 is achieved when crawling 70000 documents on 3 and 4 VMs.

Related Organizations

Petra University
Jordan
Middle East University
Jordan

Keywords

distribution methodologies, virtual machines, processor-farm methodology, Web search engine, distributed crawling, multi-core processor, virtualization, Web crawler

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average