-->

Wednesday, July 9, 2008

Tackling Lucene.Net - Part I

I was recently looking for a way to enable searching our entire Team Foundation Server source control repository, first by looking into third-party applications, or - failing that - to build one in-house.

Extensive searching revealed only a couple of leads. The first was a product from Koders (which no longer seems to be available after their buy-out), and was very promising until I realized that the search client licensing required that the software "call home" to Koders every time it is used, something which I was completely uncomfortable with (protecting our intellectual property is of paramount concern).

On the other end of the spectrum was an open-source pet project someone had started as an experiment called CS2; unfortunately it never became more than an experiment, but it opened the door for me to start writing an in-house TFS indexing & search application through examining its use of Lucene.Net. It was somewhat shocking, though, that there were no other options for this, something I would have expected to be in demand. (For the record, a full-text index on the SQL database is not possible due to the fact that TFS compresses all source files before storing them as BLOBs.)

Finding basic examples of using Lucene.Net is easy enough, CS2 alone was immensely helpful in that regard, and was sufficient to let me throw together a very basic indexer along with a web service API for searching, and a Visual Studio add-in for a front-end, and put it all out the door for general use.

The indexer pulls down the latest source from TFS every morning, and adds, re-indexes, or removes documents (of specified file types) as needed. The VS add-in accesses a web service to perform searches, which returns the TFS path to files containing "hits," and an HTML-based preview of the "hits" within each file. Currently the only data being indexed is the full text contents of text-based files. Double-clicking a result takes you to its location in the Souce Control Explorer. Very basic:


Now that I've sated somewhat the appetite for searching the repository, I've been looking at ways to enhance the experience. My first goal is to include position data - i.e., line numbers - as part of the search results. Sounds pretty simple, sure, until you try to make sense of the Lucene.Net API documentation: there are major, major gaps in the documenation for some key interfaces and classes. Searching Google to try and fill in those gaps was less than helpful, too. It seems that either no one has ever tackled anything than the most basic functionality of Lucene.Net, or else they have just never thought to document it.

I will change that somewhat in my next post, when I show an example of how to include position data in an index, and how to retrieve that data along with the search results.

1 comments:

Matt Graney said...

Apologies for the blatant product plug, but based on what you describe, you may be interested in checking out Krugle. It has a similar high level promise to that which you described for Koders, with an option to turn off the "phone home" capability. Krugle already does the sort of things you're trying to build on your own, and a whole lot more (including code parsing to enable, say, searches for function definitions vs. function calls).
Disclosure: I work for Krugle, your blog popped up on my Google alerts.