Wednesday, December 24, 2008

Terrier 2.2. released, with support for Hadoop Map Reduce indexing

I am pleased to announce that Terrier 2.2 was released, just before Christmas. While I have chosen only to increase the minor version number for this release, it a is substantial update, consisting of new support for Hadoop, a Hadoop Map Reduce indexing system, and various minor improvements and bug fixes. (I reserve major version numbers bumps for index format changes).

Our Map Reduce distributed indexing strategy builds upon the single-pass indexing strategy first released in Terrier 2.0. In deployment with a Hadoop cluster, Terrier can index large collections of data in a distributed fashion, splitting the indexing process across various Map and Reduce tasks, which can be run on various nodes in the cluster.

In particular, the input data files for the collection are split across many Map tasks. Each Map task indexes its allocated data files using a normal Collection implementation. Postings lists are built, compressed, in memory. Each time memory is exhausted, these miniature posting lists are emitted from the Map task.

The Reduce task is responsible for aggregating the posting lists for the various terms. Firstly, the Reduce input keys are sorted by term, and the values are sorted by source Map task, to ensure that the posting lists for a given term are processed in the correct order. For each term, the temporary posting lists (the reduce input values) are merged into the final compressed inverted index.

The indices created using the Map Reduce indexer are standard Terrier indices. Moreover, by controlling the number of Reduce tasks, the final index can be partitioned into separate indices, in the local inverted file layout (document partitioning). With a different partitioning scheme, global inverted file layout (term partitioning) would also be possible.

You can see the detailed list of changes for Terrier 2.2. in the documentation.

No comments: