Algolia acquires Search.io »

Read More

Inside the Data Gym: The Science Behind Optimizing Search Results Automatically

Inside the Data Gym: The Science Behind Optimizing Search Results Automatically

For the last few months, we’ve quietly introduced automatic search pipeline optimization. We call it the Data Gym. The Gym “works out” how to maximize on-site conversions like signups, purchases, revenue, margins, or whatever outcomes our customers are aiming for. 

We’re still in the early stages of Data Gym, but results have been incredibly encouraging so far. For example, due to pipeline improvement, one of our customers is on target to see a $6.2 million increase in per annum ROI already… and we're just getting started! 

What is it?

Data Gym is an automated framework for generating optimized pipelines that drive higher conversions for our e-commerce customers. Pipelines, introduced in 2020, define a series of steps that sequentially produce a search outcome. For example, a pipeline might include transformations or boosts for certain kinds of data. By changing the order or weight of a pipeline step, you can change search results. 

Search pipelines
Search.io pipelines allow customers to add features such as data transformations or make changes to the search algorithm. 

While customers can A/B test pipelines to determine which ones drive the best results, the Data Gym offers optimized results even faster and with less iteration. In fact, it allows new Enterprise customers with limited performance data available to have a very competitive pipeline as soon as they sign up. 

Pipeline features are optimised to maximise the similarity in the ranking order metric with optimal ranking derived from the historical data based on Beta scores using ranked-biased overlapping (RBO) metric. In other words, by leveraging historical data, the Data Gym conducts a virtual play within an e-commerce simulation to determine the best outcomes for the customers.

search pipeline optimization
Data Gym enforces the expected ranking by optimising the pipeline parameters being directed by the signal derived from the historical performances of the queries.

How it works

The data gym leverages the concept of black-box evolutionary optimisation technique called Cross-Entropy method. The distributions of best pipeline parameters are estimated during the optimisation process as well. The pipeline distributions provide invaluable insights into the confidence in the optimised pipeline features.

cross entropy optimization

The Data Gym works in sequence through the pipeline — it looks at each element in the pipeline and searches for the point at which the value is highest and lowest. This sequential, element-wise optimization is a conservative method for determining how to optimize the pipeline. Each feature score in the pipeline is individually optimized. 

search optimization histogram

A gradient-based optimizer is also leveraged for finding the local minima. All filters simultaneously search for the optimal feature scores. The Gym uses various other exploration techniques, too. For example, the distributions of feature scores against the best metrics are provided for aggregating the best scores over the episodes. All this is to ensure we’re simulating the optimal pipeline configuration.

search validation and backtesting

Backtesting and validation

Backtesting is a way to determine what would have happened if the pipeline had been configured differently at a point in the past. If the pipeline had X configuration, the outcome would have been Y. For an e-commerce customer, we might use conversion rate as our benchmark. The question the Data Gym is asking is whether the conversion rate would have been better or worse if the pipeline had been optimized differently. 

The Data Gym is looking at the thousands of potential outcomes to determine which one is most likely to benefit the customer. But, there’s another step in the process: validating the results. 

Validation is critical in the optimization process, and the Data Gym optimizes for the performance data when it regularizes the learning based on queries that are not present in the training dataset. It is essential to optimize for knowns and regularize for unknowns for generalization purposes. Data Gym has complex domain driven algorithms to make this possible.

data gym

In addition, the Data Gym continuously learns from historical performance data in time to follow the changing trends in sales. In other words, after launching the new, optimized pipelines, we continue to monitor performance to ensure results are what we would expect.

Although we are still working on adding new features such as a customer front-end, novel feature discovery, and multi-processing via system commands to the Data Gym, its current capabilities brought immediate return of investment to our customers beyond their expectations. 

CTA for ebook

Similar articles

What's New
AI / Machine Learning

Introducing Neuralsearch: An All-New Era in AI Search

What's New
AI / Machine Learning

New Neuralsearch Configuration Is Here

Best Practice
AI / Machine Learning

The End of Product Search Synonyms