REVIEW - Lucene in action


Lucene in action


Otis Gospodnetić, Erik Hatcher



Manning Publications (2005)




Derek Jones


February 2007



Lucene is an open source search engine written in Java (it requires that the pages to index and search have already been collected, eg, by a web crawler such a Nutch). To be exact Lucene is a search engine library containing classes and methods that programmers can call to create their own customized search engine.

This book is essentially a 'how-to' guide for creating a search engine using Lucene. I found it to be very readable and the extensive code-snippets were on the whole useful (ie, not just padding).

The discussion starts with how to index web pages, the various tradeoffs involved and how the various Lucene options can be used to tune an index to have the desired characteristics. This is followed by a very interesting discussion of how to parse the search queries, dealing with issues such as what constitutes a token and possible ways of dealing with various forms of the same root word (e.g., past/future tense, singular vs. plural). Subsequent chapters deal with more advanced topics including extending the search engine, performance testing, parsing common document formats and the book ends with a discussion of various applications (written by people involved with implementing these applications). The thickness of the book is kept down by not duplicating the online documentation by including a detailed listing of the API.

If you are building a search engine using Lucene this book is a must have. Even if you don't plan to build your own search engine this book provides a fascinating discussion of the nut-and-bolts issues involved in creating one.

Book cover image courtesy of Open Library.