Web page download scheduling policies for green web crawling

Hatzi, V.; Cambazoglu, B. B.; Koutsopoulos, I.

dc.creator	Hatzi, V.	en
dc.creator	Cambazoglu, B. B.	en
dc.creator	Koutsopoulos, I.	en
dc.date.accessioned	2015-11-23T10:29:59Z
dc.date.available	2015-11-23T10:29:59Z
dc.date.issued	2014
dc.identifier	10.1109/SOFTCOM.2014.7039136
dc.identifier.isbn	9789532900521
dc.identifier.uri	http://hdl.handle.net/11615/28451
dc.description.abstract	A web crawler is responsible for discovering new web pages on the Web as well as for refreshing the content of already downloaded pages. During these operations, it can issue a huge number of page download requests to the servers in the Web. These requests, in turn, increase the energy consumption of the servers as hardware resources are used when serving the requested pages. This has the side-effect of increasing the carbon footprint of servers. In this work, we introduce the problem of green web crawling from a set of remote web servers, where the goal is to reduce the carbon footprint incurred by a large-scale web crawler. We consider a scenario where both freshness of downloaded pages and carbon emissions at remote servers need to be taken into account. We present various heuristics for prioritizing the page download requests as a means to study the relative importance of different parameters. We conduct experiments on a real data set that involves a large server collection involving two billion pages. The results indicate that the carbon footprint generated by a crawler during its external operations can be considerably reduced without compromising the freshness of pages. Our work draws guidelines for the design of large-scale commercial search engine companies, which need to comply with certain greenness regulations. © 2014 FESB, University of Split.	en
dc.source.uri	http://www.scopus.com/inward/record.url?eid=2-s2.0-84933558291&partnerID=40&md5=790d358cddcd7fbac3386c016d1c9b24
dc.subject	Carbon footprint	en
dc.subject	Energy utilization	en
dc.subject	Environmental impact	en
dc.subject	Search engines	en
dc.subject	Websites	en
dc.subject	Carbon emissions	en
dc.subject	Engine companies	en
dc.subject	Hardware resources	en
dc.subject	Real data sets	en
dc.subject	Remote servers	en
dc.subject	Scheduling policies	en
dc.subject	Web crawlers	en
dc.subject	Web Crawling	en
dc.subject	Social networking (online)	en
dc.title	Web page download scheduling policies for green web crawling	en
dc.type	conferenceItem	en

Αρχεία σε αυτό το τεκμήριο

Αρχεία	Μέγεθος	Τύπος	Προβολή
Δεν υπάρχουν αρχεία που να σχετίζονται με αυτό το τεκμήριο.

Αυτό το τεκμήριο εμφανίζεται στις ακόλουθες συλλογές

Δημοσιεύσεις σε περιοδικά, συνέδρια, κεφάλαια βιβλίων κλπ. [19735]

Εμφάνιση απλής εγγραφής