Web page download scheduling policies for green web crawling

A web crawler is responsible for discovering new web pages on the Web as well as for refreshing the content of already downloaded pages. During these operations, it can issue a huge number of page download requests to the servers in the Web. These requests, in turn, increase the energy consumption of the servers as hardware resources are used when serving the requested pages. This has the side-effect of increasing the carbon footprint of servers. In this work, we introduce the problem of green web crawling from a set of remote web servers, where the goal is to reduce the carbon footprint incurred by a large-scale web crawler. We consider a scenario where both freshness of downloaded pages and carbon emissions at remote servers need to be taken into account. We present various heuristics for prioritizing the page download requests as a means to study the relative importance of different parameters. We conduct experiments on a real data set that involves a large server collection involving two billion pages. The results indicate that the carbon footprint generated by a crawler during its external operations can be considerably reduced without compromising the freshness of pages. Our work draws guidelines for the design of large-scale commercial search engine companies, which need to comply with certain greenness regulations. © 2014 FESB, University of Split.

URI

http://hdl.handle.net/11615/28451

Collections

Δημοσιεύσεις σε περιοδικά, συνέδρια, κεφάλαια βιβλίων κλπ. [19705]