Monday, March 25, 2013

Explicit web search result diversification


A couple of weeks ago I successfully defended my PhD thesis at the School of Computing Science of the University of Glasgow. The thesis, entitled “Explicit web search result diversification”, was unconditionally approved with no corrections by the examination board.

The thesis tackles the problem of ambiguity in web search queries. In particular, with the enormous size of the Web, a misunderstanding of the information need underlying an ambiguous query can misguide the search engine, ultimately leading the user to abandon the originally submitted query. To overcome this problem, a sensible approach is to diversify the documents retrieved for the user's query. As a result, the likelihood that at least one of these documents will satisfy the user's actual information need is increased.

In the thesis, we argue that an ambiguous query should be seen as representing not one, but multiple information needs. Based upon this premise, we propose xQuAD – Explicit Query Aspect Diversification, a novel probabilistic framework for search result diversification. In particular, the xQuAD framework naturally models several dimensions of the search result diversification problem in a principled yet practical manner. To this end, the framework represents the possible information needs underlying a query as a set of keyword-based sub-queries. Moreover, xQuAD accounts for the overall coverage of each retrieved document with respect to the identified sub-queries, so as to rank highly diverse documents first. In addition, it accounts for how well each sub-query is covered by the other retrieved documents, so as to promote novelty – and hence penalise redundancy – in the ranking. The framework also models the importance of each of the identified sub-queries, so as to appropriately cater for the interests of the user population when diversifying the retrieved documents. Finally, since not all queries are equally ambiguous, the xQuAD framework caters for the ambiguity level of different queries, so as to appropriately trade-off relevance for diversity on a per-query basis.

The xQuAD framework is general and can be used to instantiate several diversification models, including the most prominent models described in the literature. In particular, within xQuAD, each of the aforementioned dimensions of the search result diversification problem can be tackled in a variety of ways. In this thesis, as additional contributions besides the xQuAD framework, we introduce novel machine learning approaches for addressing each of these dimensions. These include a learning to rank approach for identifying effective sub-queries as query suggestions mined from a query log, an intent-aware approach for choosing the ranking models most likely to be effective for estimating the coverage and novelty of multiple documents with respect to a sub-query, and a selective approach for automatically predicting how much to diversify the documents retrieved for each individual query. In addition, we perform the first empirical analysis of the role of novelty as a diversification strategy for web search.

As demonstrated throughout the thesis, the principles underlying the xQuAD framework are general, sound, and effective. In particular, to validate the contributions of this thesis, we thoroughly assess the effectiveness of xQuAD under the standard experimentation paradigm provided by the diversity task of the TREC 2009, 2010, and 2011 Web tracks. The results of this investigation demonstrate the effectiveness of our proposed framework. Indeed, xQuAD attains consistent and significant improvements in comparison to the most effective diversification approaches in the literature, and across a range of experimental conditions, comprising multiple input rankings, multiple sub-query generation and coverage estimation mechanisms, as well as queries with multiple levels of ambiguity.

These investigations led to the publication of 12 peer-reviewed research papers and 5 evaluation forum reports directly related to the thesis. Moreover, the thesis opened up directions for other researchers, who deployed and extended the xQuAD framework for different applications, and inspired a series of workshops on Diversity in Document Retrieval as well as a research track at the internationally renown NTCIR forum. From a practical perspective, xQuAD has been subjected to scrutiny from the research community as a regular contender in both TREC and NTCIR. As the winning entry in all editions of the diversity task of the TREC Web track (best cat. B submission in TREC 2009 and TREC 2010; best overall submission in TREC 2011 and TREC 2012), we believe that the xQuAD framework has secured its place in the state-of-the-art.

The thesis is now available online at http://theses.gla.ac.uk/4106/. In addition, a reference implementation of the xQuAD framework will feature in the next major release of the open-source Terrier Information Retrieval Platform.

No comments: