Monday, June 25, 2012

Efficiency, Effectiveness, Medical Search, Dataset Development and Crowdsourcing at SIGIR 2012

The TerrierTeam will be well represented at SIGIR 2012 this year with a full paper, four posters, a demonstration and a workshop, covering a wide range of disciplines within the field of information retrieval. For those of you interested in Web search efficiency, we have a number of contributions to look for. Our full paper Learning to Predict Response Times for Online Query Scheduling defines the new area of query efficiency prediction. In particular, it postulates that not every query takes the same time to complete, particularly where efficient dynamic pruning strategies such as WAND are used to reduce retrieval latency. In our paper, we show and explain why queries with similar properties (e.g. posting list lengths) can have markedly different response times. We use these explanations to propose a learned approach for query efficiency prediction that can accurately predict the response time of a query before it is executed. Furthermore, we show that using query efficiency prediction can markedly increase the efficiency of query routing within a search engine that uses multiple replicated indices. Relatedly, our poster Scheduling Queries Across Replicas builds upon our work on query efficiency prediction, to show how a replicated and distributed search engine can be improved by the application of response time predictions. In particular, the response time predictions are used to estimate the workload of each replica of each index shard. Then each newly arrived query can be  routed to the replica of each index shard that will be ready to process the query earliest. 

At SIGIR this year we also present recent work examining both efficiency and effectiveness. Dynamic pruning strategies, such as WAND, can increase efficiency by omitting the scoring of documents that can be guaranteed not to make the top-K retrieved set - a feature known as safeness. Broder et al. showed how WAND could be made more efficient by relaxing the safeness guarantee, with little impact on the top-ranked documents. Through experiments on the TREC ClueWeb09 corpus and 33 query dependent and query independent features, our poster Effect of Dynamic Pruning Safety on Learning to Rank Effectiveness shows that relaxing safeness to aid efficiency can have an unexpectedly large impact on retrieval effectiveness when combined with modern learning to rank models, in contrast to the earlier work by Broder et al. In particular, we show that inherent biases by unsafe WAND towards documents with lower docids can markedly impact the effectiveness of learned models.

Those interested in the Medical search domain, in particular participants in the TREC Medical track, will be interested in our paper entitled Exploiting Term Dependence while Handling Negation in Medical Search. We show that it is important to handle negation in medical records - in particular, when searching for cohorts (groups of patients) with specific symptoms, our approach ensures that patients known not to have exhibited particular symptoms are not retrieved. Our results demonstrate that appropriate negation handling can increase retrieval effectiveness, particularly when the dependence between negated terms are considered using a term dependence model from the Divergence From Randomness framework

Our poster On Building a Reusable Twitter Corpus tackles an important issue raised during the creation of the Tweets11 dataset as part of the TREC Micoblog track, namely how reusable Tweets11 is, given the dynamics of Twitter. Our poster shows that corpus degradation due to deleted tweets does not effect the ranking of systems that participated in the TREC 2011 Microblog track. Meanwhile, we are also demonstrating the first release of a new extension to our Terrier IR platform, namely CrowdTerrier, which enables relevance assessments to be created in a fast semi-automatic manner using crowdsourcing. CrowdTerrier is an infrastructure addition to Terrier that enables relevance assessments to be created in a fast semi-automatic manner using crowdsourcing. CrowdTerrier will be made available for download soon.

Finally, together with a group representing six open source IR systems, we are involved in the organisation of a SIGIR'12 workshop on Open Source Information Retrieval. The workshop aims to provide a forum for users and authors of open source IR tools to get together, and to work together to build OpenSearchLab, an open source, live and functioning, online web search engine for research purposes and discuss the joint future.

Tuesday, June 19, 2012

A SMART way to Search your City

TerrierTeam is currently expanding its outreach into social and sensor-based search systems as part of the ongoing SMART EU-funded project (FP7 287583). SMART aims to develop an open source  search framework for multimedia data stemming from the physical world and social streams such as Twitter. The end-goal is to be able to answer location and time-sensitive queries such as “where can I go to listen to live music in the city centre tonight?” or “where are my friends hanging out in the city?” by augmenting social media signals with live city sensor information.

Our role in the SMART project is to develop fast and effective real-time search from the flood of information provided by social and city sensor streams on top of our open-source Terrier information retrieval platform. Indeed, a real-time Twitter search demo illustrating incremental and distributed indexing and real-time retrieval in Terrier is now available. Try it at:

SMART has seen wide-ranging national, European and international coverage in online and print news media over the last week. Indeed, we are tracking over 100 articles and counting! Some sample articles can be found below:
A more detailed list of recent press coverage can be found at