At Search.io we’ve had great success using reinforcement learning to continuously and automatically improve our search result rankings. We can even benchmark algorithm improvement over time across different datasets, languages and completely different search ranking problems.
Machine learning is increasingly part of the search ranking algorithm in all search technology. However, due to the immutability of most search technology indexes reinforcement learning typically incurs massive performance overheads due to the frequent scoring updates, so to date more focus has been spent on learn-to-rank for these technologies.
We use both learn-to-rank and reinforcement learning techniques, this article is focused more on the latter.
Learn to rank
Learn-to-rank systems take a “gold standard” set of human labelled (or feedback based, eg. click data) query-result pairs combined with generated features to create machine learning models to improve the ranking of search results.
In the past, search ranking has been mostly human configured, and is thus immediately limited by one person’s bias, and their ability to visualise the dataset as a whole. As a result, these rankings typically work well for a portion of queries and fail hopelessly for others. Fixing the failures without causing problems elsewhere is very hard. A good analogy would be pulling a bunch of levers, the issue at top of mind improves, but it’s not immediately clear how this impacts everything else. Search is not a simple problem and humans can’t balance this problem across thousands or even millions of different queries.
Configuring standard search is like pulling levers.
Learn-to-rank is designed to partially take this away from human control and instead let machine learning do the optimisation at an individual query level. Typically this means computing the results with a static human configured approach (this is fast and efficient) and then re-ranking the top X results with a machine learning model (not so efficient).
This approach is pretty smart and has been adopted by some of the major open source search technologies to produce great outcomes. But there are downsides, a) it still requires a lot of engineering work to get up and running and b) the query-result pair index scores are fixed or at best seldomly updated. This means the algorithm inputs remain static, a step forward, but using reinforcement it’s possible to do even better and improve results at a much faster rate.
The basic idea of reinforcement learning is quite simple, use feedback to reinforce (hence strengthen) positive outcomes. Instead of making large changes infrequently, reinforcement learning makes frequent incremental changes. There are many upsides to this, such as continuously improving results and faster surfacing of other potential results. Poorly performing results also tend to fall away quickly through rolling experimentation.
In our case high performing (clicks, sales, signups, etc) search results for a given query indicate a positive outcome, which is fed back into the query-result intersection score in the search index. This score is a type of confidence interval that balances the uncertainty of the sample size with observed performance. Confidence intervals are used in all sorts of places, for example in Reddit’s article ranking.
Wilson confidence intervals are described nicely by Evan Miller, where he shows how lots of rating algorithms make the mistake of using the simple calculation of average ratings instead, which is highly unreliable for smaller samples. Directly quoted from Evan:
Given the ratings I have, there is a 95% chance that the “real” fraction of positive ratings is at least what?
In the search world a positive rating can mean different things, a search result was clicked, or led to a later event such as a sale, etc. We mostly focus on clicks as there is significantly more available data (faster to get to higher confidence), but we also use later events if there is sufficient data.
When using clicks to determine ratings you need to correct for position bias, short clicks (dissatisfaction with the clicked result) and various other factors. The positive ratings then roughly correlate to results clicked more frequently and negative ratings to those less frequently clicked. The confidence interval helps to correct for the sample size by calculating a probability distribution.
So at this point we have a confidence interval distribution for each query-result pair. We then do something similar to a Bayesian bandit algorithm, which randomly selects a score from our confidence interval distribution for each query-result pair. When the sample number is low, the interval is broad, but as the sample number increases, the confidence interval tightens and the best results begin to consistently score higher.
Chris Stucchio’s article on bandit algorithms illustrates sharpening of the probability distribution with increase sample size.
Measuring ranking success
In order to determine success we need to be able to measure how well we are ranking across all queries. To do this we currently use normalised Discounted Cumulative Gain (nDCG). This is explained nicely by Hugh Williams in a post on measuring search relevance.
In a nutshell nDCG describes how optimally a set of results are ranked against the best possible ordering (IDCG). The Ideal Discounted Cumulative Gain (IDCG) provides a benchmark score for the optimal result ordering and the DCG calculation gives a score for an actual result set. The ratio of DCG / IDCG then gives us nDCG.
This isn’t perfect as our IDCG calculation is determining the ideal ordering using the same performance data as the reinforcement, but keep in mind it would be impossible to get human scores for all our queries (millions) and also our reinforcement learning score is only one of many ranking algorithm inputs, so in practice this is OK. The more data we collect the more result experimentation occurs and the higher the IDCG value climbs, hence the ranking also needs to improve to keep nDCG in step.
Monitoring ranking performance
In practice nDCG is continuously climbing for most of our search indexes as the reinforcement learning optimises the index scores, but occasionally it drops. To monitor this (and our overall performance) we use BigQuery to periodically process the most recent search and click data (terabytes) to calculate ranking performance.
Typical queries process many gigabytes of data for each day included in the calculation, and this query takes only a few seconds to process. The output looks as follows:
nDCG this week and change from last week by collection.
Aside from direct ranking model changes and other experiments, reinforcement learning improves our average nDCG by ~1% per week, which is pretty impressive.
This type of ranking performance analysis can also easily be broken down by all the other data we collect. e.g. by company, language, location, user-agent, feature flag, ranking experiment, custom user variables, etc.
We can’t even read all the languages we support, but we know exactly how well their ranking is performing…
Aside from the ranking efficiency, IDCG also allows us to see which individual queries have no good possible results. From a customer perspective this helps to show content gaps in user demand. As we expand our reporting capabilities this year our customers will get even more granular access to these and more analytics regarding content and customer intent and we will continue to drive alerting and optimisation of the underlying technology.
Reinforcement learning is an amazing technique to assist the self-learning ranking of search results. It undoubtedly helps to automate the optimisation of our search technology, bringing it closer to our goal of ranking information with human intelligence, but millions of times faster.
The down sides of reinforcement learning are more subtle, such as managing expectations at day zero with no historical performance data. It’s not easy to tell a potential customer their performance will improve noticeably once they start sending production queries.
We also need better ways to explain probabilistic optimisation to businesses, as conceptually people still tend to focus on individual searches and the particular result they expected to top rank.
It’s difficult to tell people the result set they are looking at is not “the result set”, but actually just one probabilistic sample projection based on past historical performance.
To make it even more complicated, boosting trees and neural networks make the ranking even harder to explain!
Although self-learning indexes are rare today, they will become the standard in years to come as machine learning further influences data and the underlying storage structures. Recently learned index structures have been the topic of much discussion, we see this as just the beginning of huge changes to the way data is stored and processed in both databases, search indexes and hybrid approaches.