For the last few months, we’ve quietly introduced automatic search pipeline optimization. We call it the Data Gym. The Gym “works out” how to maximize on-site conversions like signups, purchases, revenue, margins, or whatever outcomes our customers are aiming for.
We’re still in the early stages of Data Gym, but results have been incredibly encouraging so far. For example, due to pipeline improvement, one of our customers is on target to see a $6.2 million increase in per annum ROI already… and we're just getting started!
What is it?
Data Gym is an automated framework for generating optimized pipelines that drive higher conversions for our e-commerce customers. Pipelines, introduced in 2020, define a series of steps that sequentially produce a search outcome. For example, a pipeline might include transformations or boosts for certain kinds of data. By changing the order or weight of a pipeline step, you can change search results.
While customers can A/B test pipelines to determine which ones drive the best results, the Data Gym offers optimized results even faster and with less iteration. In fact, it allows new Enterprise customers with limited performance data available to have a very competitive pipeline as soon as they sign up.
Pipeline features are optimised to maximise the similarity in the ranking order metric with optimal ranking derived from the historical data based on Beta scores using ranked-biased overlapping (RBO) metric. In other words, by leveraging historical data, the Data Gym conducts a virtual play within an e-commerce simulation to determine the best outcomes for the customers.
How it works
The data gym leverages the concept of black-box evolutionary optimisation technique called Cross-Entropy method. The distributions of best pipeline parameters are estimated during the optimisation process as well. The pipeline distributions provide invaluable insights into the confidence in the optimised pipeline features.
The Data Gym works in sequence through the pipeline — it looks at each element in the pipeline and searches for the point at which the value is highest and lowest. This sequential, element-wise optimization is a conservative method for determining how to optimize the pipeline. Each feature score in the pipeline is individually optimized.
A gradient-based optimizer is also leveraged for finding the local minima. All filters simultaneously search for the optimal feature scores. The Gym uses various other exploration techniques, too. For example, the distributions of feature scores against the best metrics are provided for aggregating the best scores over the episodes. All this is to ensure we’re simulating the optimal pipeline configuration.
Backtesting and validation
Backtesting is a way to determine what would have happened if the pipeline had been configured differently at a point in the past. If the pipeline had X configuration, the outcome would have been Y. For an e-commerce customer, we might use conversion rate as our benchmark. The question the Data Gym is asking is whether the conversion rate would have been better or worse if the pipeline had been optimized differently.
The Data Gym is looking at the thousands of potential outcomes to determine which one is most likely to benefit the customer. But, there’s another step in the process: validating the results.
Validation is critical in the optimization process, and the Data Gym optimizes for the performance data when it regularizes the learning based on queries that are not present in the training dataset. It is essential to optimize for knowns and regularize for unknowns for generalization purposes. Data Gym has complex domain driven algorithms to make this possible.
In addition, the Data Gym continuously learns from historical performance data in time to follow the changing trends in sales. In other words, after launching the new, optimized pipelines, we continue to monitor performance to ensure results are what we would expect.
Although we are still working on adding new features such as a customer front-end, novel feature discovery, and multi-processing via system commands to the Data Gym, its current capabilities brought immediate return of investment to our customers beyond their expectations.