Searching A Field With Digits In Zend Framework’s Lucene Component

Recently I ran into a bug in one of our applications using Zend_Search_Lucene where the same document was showing up multiple times in search results. Actually, many different documents were showing up more than once. I tracked it down to the routine that updated indexed documents. With Zend_Search_Lucene you can’t actually update indexed documents, but you can delete an indexed document and then insert a new, updated, document. In order to delete a document you first must search for it by a previously indexed field and then, once found, delete it using its internal document identifier. The problem seemed to be that documents were not being found and deleted when updated, thus duplicates of the same document were accumulating on each update.

The field I was indexing, and subsequently using to find and delete documents, was a 40 character SHA-1 hash. While trying to track down the bug in my application, I discovered that only documents having a SHA-1 hash beginning with a digit were getting duplicated (in other words, were not being found when I tried to delete them) and that documents having a SHA-1 beginning with a letter were not getting duplicated (in other words, were being found and deleted).

A Stack Overflow post on searching numbers with Zend_Search_Lucene had the information I needed to fix the bug. First, I changed the hash field from a text field to a keyword field which prevented it from being tokenized (this, of course, required me to delete the existing index and re-index every document). Second, when searching on the hash field I replaced the default Text analyzer with the TextNum analyzer. These two changes seemed to do the trick, as I haven’t seen any duplicate search results after having run several index updates.

This entry was written by Bradley Holt, posted on April 17, 2010 at 5:42 pm, filed under Uncategorized and tagged Lucene, PHP, Zend Framework. Bookmark the permalink. Both comments and trackbacks are currently closed.

Bradley Holt