Algolia acquires Search.io »

Read More
ALTERNATIVE

Best Alternatives to Lucene

Lucene is the godfather of site search. However, it has long since been surpassed by two Lucene derivativers, Elasticsearch and Solr. Newer non-Lucene search technology is here too. In this article, we'll walk through why companies may still choose Lucene and share some newer search alternatives.

Introduction

To borrow an analogy from Greek mythology, Apache Lucene is the Cronus of search: it has spawned one of the most dominant rises in open source software history.  It’s two main offspring, Elasticsearch and Solr, have outcompeted Lucene itself and dominated the search landscape for years.

Lucene isn’t entirely irrelevant — it’s still used everywhere for home-grown search, enterprise search, recommendation engine projects, and many other popular products — but increasingly people are leaving Lucene for newer search technologies that promise to be faster, easier to work with, and significantly more powerful.

Google Trends for Elasticsearch

Lucene history

The Lucene open source software project was first released in 1999 and later added to the Apache Foundation in 2005.

Lucene builds an inverted index of your data for full text information retrieval — essentially it indexes your data by keyword — and provides libraries for features such as typo tolerance, sorting, ranking, and much more.

With its exceptional documentation and large community, Lucene remains a search workhorse. Lucene’s core libraries offer just about everything you need to build a search application. What it lacks can be found in the many public libraries that fill in missing functionality such as crawling. If you’re looking for a free, open source library and SDK to build search, Lucene is not a bad choice. Enterprises may like that it’s written in Java, too, for tying into legacy projects.

But if you need something more full-featured and modern, there are similar options. In fact, it was the lack of some of the basic search features that allowed two newer projects, Solr and Elasticsearch, to flourish. Both of these Lucene derivatives are essentially full-featured wrappers that have hidden Lucene’s libraries behind more powerful and easier to implement APIs. They also allow Lucene to scale to very large datasets as a distributed system (something Lucene lacks).

Both Solr and Elasticsearch are Apache open source projects as well and benefit from a large community of developers. (However, due to recent licensing changes, it is questionable whether Elasticsearch remains open source.)

Limitations of Lucene

We built Sajari because we felt Lucene couldn’t deliver on our key goals, which were based around fully real-time data updates and complex machine learning based retrieval. Other new search competitors have entered the market, too, further pushing Lucene to the side. Lucene has several key deficiencies that will become more apparent in time:

  • Performance
  • Machine learning
  • Time to value

Performance

Whenever a record changes in your database, Lucene will store the new value, but it still hangs onto the old value.

This slows down queries as it needs to run the search, then check the differential for changes on the way out. So if you make a small change to an item, it basically stores a flag to say that is deleted and creates an entire new record in memory. Periodically the memory buffer fills up, the difference is reconciled and all the files with any changes get merged and rewritten out to disk.

This is fine for projects such as log analysis because logs don’t change (it’s one of the reasons that Elasticsearch is focused on this use case with its ELK stack) but less so for use cases like e-commerce search with data that changes frequently.

Machine learning

The days of relying on simple reverse index ranking algorithms are gone. To improve relevancy and results, language embeddings (vectors) and machine learning should be core to every search project. AI search technology provides a faster, automated feedback loop to improve search result ordering to maximize business goals such as purchases or engagement.  

Older technologies like Lucene that entirely rely on index matching typically need an army of people writing business rules to make up for an endless set of deficiencies.  

Search with machine learning outperforms basic keyword search. ML-powered re-ranking (or sometimes called learn-to-rank) offers a secondary ranking stage to try to fix the deficiencies of naive keyword ranking.

Time to value

Lucene can get you where you need to go, but with hidden costs. It’s free to use, but not cost-free. Building a v0.1 Lucene search application from start to finish might not take long, but optimizing it to achieve high relevance, fast speeds, and better results, and making it production-ready for large scale projects requires a serious investment.

The faster path for adding site search is via a more modern search project that provides the basic Lucene-like search capabilities but also comes production-ready in terms of scale, machine learning, data pipelines and much more.

This is why projects like the top 5 search products presented below have been able to attract customers away from using Lucene. They help customers achieve value faster at less cost.

Top Apache Lucene Alternatives

Search.io

Search.io is a user-friendly site search engine built from the ground up for developers. It’s an entirely new hosted service built on proprietary technology.

Search.io has the combined power of a full-text search engine and a database. Search.io uses real time indexes, it's own data layout/flow, and it's own binary encoding methodology to provide lightning fast results without taking a performance hit like Lucene-based solutions.

It offers tremendous flexibility and ease of configuration built on top of a cloud-native architecture for elastic scale. Machine learning (more specifically, reinforcement learning) is built into the core product for continuous improvement of search performance. Because it’s fully-hosted and battle-tested with billions of queries, you can spend more time working on your core business without having to manage search scale.

Search.io features include:

  • Instant indexing with full-text crawler, including document (PDF, DOCX) search
  • Easy to add search advanced capabilities via simple YAML-based configuration
  • Machine learning included for always-on improvement
  • Fully hosted and performant to thousands of queries per second
  • Search UI generation and UI component libraries to build search experiences

Best use cases:

Additionally, Search.io has taken a different approach to configuration and extensibility, moving configuration from config.xml files to a core, built-in feature called pipelines. Pipelines are YAML-based scripts that define a series of steps which are executed sequentially when indexing a record (record pipeline) or performing a query (query pipeline). With pipelines, you can configure the search algorithm to improve search relevance or even A/B test different algorithms to determine which one provides the best search experience.

Core features, such as crawling, autocomplete, schema configuration, document indexing, synonyms, filters, faceted search, etc. are all baked in. In addition, Search.io offers a REST-like API for connecting to business data and Node, PHP, and Go SDKs, and React and JavaScript libraries for complete front-end customization.

Sign up for a free 14-day trial of Search.io.

Algolia

Elasticsearch

Elasticsearch is an API product built on a Lucene core. Elasticsearch is a specialized search engine that has built a massive community around logging analytics projects with its popular ELK stack, which was open source up until 2021.

Elasticsearch offers great flexibility and scale for different use cases. For best results, it requires teams of specialist engineers who have the time, resources, and capabilities to eke out higher performance or develop custom features. It’s ideal for projects that generate massive amounts of immutable data like log analysis (this is where Lucene-based search shines as log data does not change) and SIEM use-cases.

Elasticsearch features include:

  • Instant indexing and full-text search, including document (PDF, DOCX) search
  • Scalability and resilience for high-volume use casese
  • Popular community and support

Best use cases:

  • Logging and log analytics (along with Logstash)
  • Full-text search
  • Scraping and combining with public datasets
  • Metrics (along with Kibana)

Available both as a free open source download or fully-hosted through Elastic or other providers (including AWS, more on that below), there’s a large number of options for getting a project started.

Elastic acquired Swiftype in 2017. Swiftype is built on top of Elasticsearch (it’s a wrapper around Elasticsearch which is a wrapper around Lucene!) and better supports full-text search use cases.

Solr

The Apache Solr project, based on Apache Lucene, was originally created by CNET to provide full-text search across the company’s massive media database. Since 2004, Solr has seen many iterations and improvements and has built an enormous community of contributors. It’s a powerful and scalable search engine written in Java with a full complement of libraries for C#, PHP, Python and other languages, and offers HTTP REST-like APIs with support for both XML and JSON.

For larger use cases with a team of dedicated search engineers, Solr is a solid search software solution. But for businesses that want to allocate engineering resources differently, Solr’s strengths — scalability, configurability, extensibility — can also be liabilities.

Even small projects can require days of engineering to get up and running into staging environments (let alone production use cases). Today, alternative search solutions built on cloud-native architecture, can offer the same degree of configurability and scale in much less time.

Algolia features include:

  • Typo tolerant full-text search
  • Simple and easy to understand tie-breaking relevance algorithm
  • Global language support
  • Very fast information retrieval

Best use cases:

  • App search
  • Mobile search

Algolia is popular because of how simple and easy it is to get started. It’s a great general purpose search engine. But, it has its critics too, particularly around pricing and complexity for managing custom rules and configurations. For example, anytime Algolia re-indexes the database — such as for A/B testing — it counts against monthly search queries quota. Features such as machine learning are add-ons that also cost more. It's ranking algorithm is a simple tie-breaking algorithm, which is easier to understand but also less flexible and powerful than other solutions on the market.

Algolia

Algolia

Like Search.io, Algolia is a new search engine built from the ground up. Originally, Algolia was developed for mobile search use cases, but has since been extended to more traditional search projects. Algolia can boast about its retrieval speed; it’s milliseconds faster than the competition. Those few milliseconds won’t matter for most use cases, but if speed is important, Algolia is worth a look. As a fully-hosted product, Algolia also eliminates the need for cluster management.

Algolia features include:

  • Typo tolerant full-text search
  • Simple and easy to understand tie-breaking relevance algorithm
  • Global language support
  • Very fast information retrieval

Best use cases:

  • App search
  • Mobile search

Algolia is popular because of how simple and easy it is to get started. It’s a great general purpose search engine. But, it has its critics too, particularly around pricing and complexity for managing custom rules and configurations. For example, anytime Algolia re-indexes the database — such as for A/B testing — it counts against monthly search queries quota. Features such as machine learning are add-ons that also cost more. It's ranking algorithm is a simple tie-breaking algorithm, which is easier to understand but also less flexible and powerful than other solutions on the market.

Cloud search providers: Azure, AWS, and GCP

The major cloud providers offer their own search services, some of which are based on Lucene-derivatives. This includes Amazon Elasticsearch Service and Azure Cognitive Search, which have been forked from the Elasticsearch open source core, and Amazon’s Cloudsearch which has been built on top of Solr. Google has built their own search service (Cloud Search) from scratch.

Cloud providers offer both private and public hosted search solutions. If your app is hosted in one of these providers, then it might be worth considering them for your search service as well. Co-locating your search service with your app makes a lot of sense for reducing latency.

The pros and cons of each cloud service provider and software vary a lot. But they have some similarities:

  • Great support and availability of APIs for embedding search into almost any site or application
  • Built to scale, but still require a good deal of hand-holding
  • Each solution requires expertise and overhead for managing the search instance
  • Best suited if you want to co-locate search with your site or app

Best use cases:

  • Enterprise search
  • Custom full-text search applications
  • Monitoring and analysis

For developers who want to stay with a purely open source search engine and build new search applications from scratch, there are also open source projects such as Sphinx, Typesense, and Mielisearch. For most search applications, the top 5 solutions above are a best bet for seeing immediate results.

What are you waiting for? Get in on the action