Structural analysis and classification of search interfaces for the deep web

The Web has been identified to consist of a large portion of content that cannot be crawled by general-purpose search engines because it is only generated after a valid submission to a search interface. Accessing such content, however, requires the location and identification of search interfaces. Towards the automation of this task, many approaches have been proposed that involve the manual definition of rules for the identification of query interfaces. In this paper, we propose a rule induction approach to automatically construct a set of rules by searching the most promising subspace of all possible rules with a brute-force method and information theoretic criteria. To specify the features for the rules, we initially make a descriptive analysis of Yahoo L11, a specialized dataset containing complex interfaces, which to the best of our knowledge has not been used in previous works. We perform a series of evaluations and present the rules constructed by running the algorithm on a random sample of the Yahoo L11 dataset and another dataset used in similar works. The resulting rules yield high classification accuracy in predicting the functionality of new, previously unseen forms and since humans can easily interpret them, they can be easily ported to any application as-is. © 2018 The British Computer Society. All rights reserved.

URI

http://hdl.handle.net/11615/74972

Collections

Δημοσιεύσεις σε περιοδικά, συνέδρια, κεφάλαια βιβλίων κλπ. [19735]