Indexing site contents using Lucene .NET

17 Mar 2018

Most discussions on implementing a search feature for an application would often end-up with an easy choice or rather a familiar name - Google Search Appliance (GSA). Well! I wouldn't go against that preference/selection, as GSA is a best generic search solution/product (developed by search giant) which is flexible & scalable enough to meet all your business needs. It offers varied range of features like translations, auto-suggestions, synonyms, recognizing the type of content/entities, intelligently evolves to better match user search preference & behaviour, administration console, security etc.,

But all above benefits will incur pricing based on the amount of document/data you wish to index & keep on Google's Data Centers, number of instances to serve the requests & further server maintenance, future upgrades etc.,. Considering all cost involved, it is still the best and most reliable solution for enterprises that wish to pay and ready to go with using platform as a Service model.

Rather than writing against GSA, this article aims to cover about existence of an alternative solution based on Apache Lucene.Net - a full-text search engine framework. It is not a complete search solution or product like GSA, but a code library and APIs that enables you to add search capabilities to application that you develop.

Apache Lucene.Net is

  • an Opensource Library
  • released under Apache 2.0 license - can be used for commercial purpose with few limitations
  • library was initially written in Java and then ported to .NET (actively developed & maintained)
  • supports .NET framework 3.5+ and .NET standard
  • multiple smaller Nuget packages available, that you can compose together based on the application's requirement.

Why Lucene?

  • It offers complete programmatic control on what data to analyze, index and store OR just analyze but don't store.
  • It doesn't crawl automatically, so you feed the data - and you decide what appears in the result and what should not.
  • It is cost effective
  • It can be setup within a company's space to crawl secure data & files, so that the data doesn't need to leave the building tongue-out.
  • You can implement various flavours of search functionalities like product or any catalogue search, files search, auto-complete suggestions, site search etc.,
  • You can use you custom tags or categories or marker data to search (Facets).
  • Lucene is internally used by many software giants like
    • Microsoft on cloud platforms - Azure,
    • Sitecore, Sitefinity, Umbraco, Orchard on their CMS platforms,
    • Stackoverflow
    • Autodesk
    • Powersearch,
    • Oxford analytics
    • Raven DB, lucandranet so on.,
  • You can power a crawler bot with Lucene's search capability, to collect, analyse & monitor a site's content or keywords for SEO purposes. 

In this article, I will cover the basic indexing and searching capability of Lucene and provide a helper class that can help you getting-started with the Lucene search library integration.

Nuget packages required for basic indexing, and querying indexed data

Following code snippet is shared right from the code base of this website, which provides the basic site search feature that you notice on this site's top navigation.

using Com.Davidsekar.Search.Model;
using Lucene.Net.Analysis.Standard;
using Lucene.Net.Documents;
using Lucene.Net.Index;
using Lucene.Net.QueryParsers.Classic;
using Lucene.Net.Search;
using Lucene.Net.Store;
using Lucene.Net.Util;
using System;
using System.Collections.Generic;
using System.Linq;

namespace Com.Davidsekar.Search
{
    public class SearchCore
    {
        private string[] _searchFields = { "Id", "Title", "Keywords", "Location", "Content", "Description" };
        private string _sIndexPath;

        public SearchCore(string folderPath)
        {
            _sIndexPath = folderPath;
        }

        public bool GenerateSearchindex(SearchItem item)
        {
            List<SearchItem> items = new List<SearchItem>
            {
                item
            };
            return GenerateSearchIndex(items.AsQueryable(), false);
        }

        public bool GenerateSearchIndex(IQueryable<SearchItem> items, bool bClearAll)
        {
            using (IndexWriter indexWriter = GetIndexWriter(bClearAll))
            {
                foreach (var item in items)
                {
                    var doc = new Document
                    {
                        new StringField(_searchFields[0], item.Id.ToString(), Field.Store.NO),
                        new StringField(_searchFields[1], item.Title, Field.Store.YES),
                        new TextField(_searchFields[2], item.Keywords, Field.Store.NO),
                        new TextField(_searchFields[3], item.MediaLocation, Field.Store.YES),
                        new TextField(_searchFields[4], item.Content, Field.Store.NO),
                        new TextField(_searchFields[5], item.Description, Field.Store.YES)
                    };

                    indexWriter.UpdateDocument(new Term(_searchFields[0], item.Id.ToString()), doc);
                }

                indexWriter.Commit();
            }
            return true;
        }

        public IndexWriter GetIndexWriter(bool recreate = false)
        {
            if (!System.IO.Directory.Exists(_sIndexPath))
                System.IO.Directory.CreateDirectory(_sIndexPath);

            var dir = FSDirectory.Open(_sIndexPath);
            var analyzer = new StandardAnalyzer(LuceneVersion.LUCENE_48);
            var config = new IndexWriterConfig(LuceneVersion.LUCENE_48, analyzer)
            {
                OpenMode = recreate ? OpenMode.CREATE : OpenMode.CREATE_OR_APPEND
            };
            return new IndexWriter(dir, config);
        }

        public SearchResult SearchIndex(string searchQuery, int skip, int limit, string searchField = "")
        {
            if (string.IsNullOrEmpty(searchQuery.Replace("*", "").Replace("?", "")))
                return new SearchResult() { TotalCount = 0 };

            Directory dir = FSDirectory.Open(_sIndexPath);

            using (IndexReader reader = DirectoryReader.Open(dir))
            {
                var searcher = new IndexSearcher(reader);
                var hits_limit = skip + limit;
                using (var analyzer = new StandardAnalyzer(LuceneVersion.LUCENE_48))
                {
                    QueryParser parser;
                    if (!string.IsNullOrEmpty(searchField))
                        parser = new QueryParser(LuceneVersion.LUCENE_48, searchField, analyzer);
                    else
                        parser = new MultiFieldQueryParser(LuceneVersion.LUCENE_48, _searchFields, analyzer);

                    var searchResult = new SearchResult();
                    var query = ParseQuery(searchQuery, parser);
                    var topdocs = searcher.Search(query, null, hits_limit, Sort.RELEVANCE);
                    searchResult.TotalCount = topdocs.TotalHits;

                    var selectedHits = topdocs.ScoreDocs.Skip(skip).Take(limit).ToList();
                    searchResult.Items = _mapLuceneToDataList(selectedHits, searcher).ToList();
                    return searchResult;
                }
            }
        }

        private SearchItem _mapLuceneDocumentToData(Document doc)
        {
            return new SearchItem
            {
                Id = Convert.ToInt32(doc.Get(_searchFields[0])),
                Title = doc.Get(_searchFields[1]),
                Description = doc.Get(_searchFields[5]),
                MediaLocation = doc.Get(_searchFields[3]),
                Keywords = doc.Get(_searchFields[4])
            };
        }

        private IEnumerable<SearchItem> _mapLuceneToDataList(IEnumerable<Document> hits)
        {
            return hits.Select(_mapLuceneDocumentToData);
        }

        private IEnumerable<SearchItem> _mapLuceneToDataList(IEnumerable<ScoreDoc> hits,
            IndexSearcher searcher)
        {
            return hits.Select(hit => _mapLuceneDocumentToData(searcher.Doc(hit.Doc)));
        }

        private Query ParseQuery(string searchQuery, QueryParser parser)
        {
            Query query;
            try
            {
                query = parser.Parse(searchQuery.Trim());
            }
            catch (ParseException)
            {
                query = parser.Parse(QueryParser.Escape(searchQuery.Trim()));
            }
            return query;
        }
    }
}

The data model to pass the search result data to other layers
using System.Collections.Generic;

namespace Com.Davidsekar.Search.Model
{
    public class SearchItem
    {
        public int Id { get; set; }
        public string Title { get; set; }
        public string Content { get; set; }
        public string MediaLocation { get; set; }
        public string Keywords { get; set; }
        public string Description { get; set; }
    }

    public class SearchResult
    {
        public List<SearchItem> Items { get; set; }
        public int TotalCount { get; set; }
    }
}