From MFKP_wiki

Jump to: navigation, search


Tractable near-optimal policies for crawling

Yossi Azar, Eric Horvitz, Eyal Lubetzky, Yuval Peres, Dafna Shahaf



Significance. We present a tractable algorithm that provides a near-optimal solution to the crawling problem, a fundamental challenge at the heart of web search: Given a large quantity of distributed and dynamic web content, what pages do we choose to update a local cache with the goal of serving up-to-date pages to client requests? Solving this optimization requires identifying the best set of pages to refresh given popularity rates and change rates—an intractable problem in the general case. To overcome this intractability, we show that the optimal randomized strategy can be efficiently determined (in near-linear time) and then use it to produce a deterministic policy that exhibits excellent performance in experiments.

Abstract. The problem of maintaining a local cache of n constantly changing pages arises in multiple mechanisms such as web crawlers and proxy servers. In these, the resources for polling pages for possible updates are typically limited. The goal is to devise a polling and fetching policy that maximizes the utility of served pages that are up to date. Cho and Garcia-Molina [(2003) ACM Trans Database Syst 28:390–426] formulated this as an optimization problem, which can be solved numerically for small values of n, but appears intractable in general. Here, we show that the optimal randomized policy can be found exactly in O(nlogn) operations. Moreover, using the optimal probabilities to define in linear time a deterministic schedule yields a tractable policy that in experiments attains 99% of the optimum.


Proceedings of the National Academy of Sciences, Vol. 115, No. 32. (07 August 2018), pp. 8099-8103, https://doi.org/10.1073/pnas.1801519115 
Key: INRMM:14623862

Keywords

         

Article-Level Metrics (Altmetrics)
Digital Object Identifier


Available versions (may include free-access full text)

DOI, HighWire, HighWire (PDF), Pubmed, Hubmed, Pubget

Versions of the publication are also available in Google Scholar.
Google Scholar code: GScluster:1927648922070511153

Works citing this publication (including grey literature)

An updated list of who cited this publication is available in Google Scholar.
Google Scholar code: GScites:1927648922070511153

Further search for available versions

Search in ResearchGate (or try with a fuzzier search in ResearchGate)
Search in Mendeley (or try with a fuzzier search in Mendeley)

Publication metadata

Bibtex, RIS, RSS/XML feed, Json, Dublin Core
Metadata search: CrossRef DOI, DataCite DOI

Digital preservation of this INRMM-MiD record

Internet Archive

Meta-information Database (INRMM-MiD).
This database integrates a dedicated meta-information database in CiteULike (the CiteULike INRMM Group) with the meta-information available in Google Scholar, CrossRef and DataCite. The Altmetric database with Article-Level Metrics is also harvested. Part of the provided semantic content (machine-readable) is made even human-readable thanks to the DCMI Dublin Core viewer. Digital preservation of the meta-information indexed within the INRMM-MiD publication records is implemented thanks to the Internet Archive.
The library of INRMM related pubblications may be quickly accessed with the following links.
Search within the whole INRMM meta-information database:
Search only within the INRMM-MiD publication records:
Full-text and abstracts of the publications indexed by the INRMM meta-information database are copyrighted by the respective publishers/authors. They are subject to all applicable copyright protection. The conditions of use of each indexed publication is defined by its copyright owner. Please, be aware that the indexed meta-information entirely relies on voluntary work and constitutes a quite incomplete and not homogeneous work-in-progress.
INRMM-MiD was experimentally established by the Maieutike Research Initiative in 2008 and then improved with the help of several volunteers (with a major technical upgrade in 2011). This new integrated interface is operational since 2014.