
handle: 2434/579265 , 2434/495566 , 11568/790766 , 11568/919603 , 2158/1149243 , 2158/1149276
Although web crawlers have been around for twenty years by now, there is virtually no freely available, open-source crawling software that guarantees high throughput, overcomes the limits of single-machine systems, and, at the same time, scales linearly with the amount of resources available. This article aims at filling this gap, through the description of BUbiNG, our next-generation web crawler built upon the authors’ experience with UbiCrawler [9] and on the last ten years of research on the topic. BUbiNG is an open-source Java fully distributed crawler; a single BUbiNG agent, using sizeable hardware, can crawl several thousand pages per second respecting strict politeness constraints, both host- and IP-based. Unlike existing open-source distributed crawlers that rely on batch techniques (like MapReduce), BUbiNG job distribution is based on modern high-speed protocols to achieve very high throughput.
web crawling; centrality measures; distributed systems, Social and Information Networks (cs.SI), FOS: Computer and information sciences, Web crawling, Computer Science - Social and Information Networks, Computer Networks and Communications; Software, Information Retrieval (cs.IR), Computer Science - Information Retrieval
web crawling; centrality measures; distributed systems, Social and Information Networks (cs.SI), FOS: Computer and information sciences, Web crawling, Computer Science - Social and Information Networks, Computer Networks and Communications; Software, Information Retrieval (cs.IR), Computer Science - Information Retrieval
| citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 86 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 1% | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Top 10% | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Top 10% |
