What goes into a great search experience?

The Elements of Search [Infographic]

The Power of Concept Search for Your Site

The Power of Concept Search for Your Site

For many years search engines have predominantly relied on keywords. It hasn’t worked very well! The good news is this era is almost over and much brighter days lie ahead. A major transformation of search is underway. It is being driven by representing text mathematically, concepts are taking over and it’s great news for everyone.

In this article, I’ll explain a bit about what concept search is and how the AI technology around it is changing. It’s helpful to first understand the limitations of traditional keyword-based models. 

Background on search

Over 80% of all data is free text, commonly called unstructured data (as opposed to structured data like age, weight, price, addresses, etc). In order to find things in unstructured information, search engines have been the tool of choice. The main methodology behind this has been the tokenization of keywords, which turns words into common lookup keys which are then used to build indexes. 

To the end user, essentially this replicates the same approach as the indexes in the back of a physical book. Each token (single word or phrase) is linked to the pages it occurs in throughout the book, which allows people to quickly turn to the page and find what they were looking for, analogously to search terms.

As organizations moved online, enterprise search became a key requirement for information management. As data and information assets in general further exploded in volume, the importance of search has only increased exponentially. Yet without document enrichment with intelligent metadata, auto-classification, taxonomy management and other methods of adding structure, relevancy has typically been poor. The result is that people at work cannot find relevant documents — and this is a big problem.

Why are keywords problematic for search?

Keywords are hard for search engines. You have synonymy (multiple words with shared meaning), polysemy (words with multiple meanings), sequence (order is sometimes important but not always), abbreviations, asymmetry (query words not expected to appear in target results), and more.

In general, keyword search implies you already know the answer to what you’re looking for and how it will be explicitly described. For example:

  • You search for “crewneck” and you don’t find “tshirts”.
  • “usbc” vs “usb-c” or “usb c”. Some variations have many results and some show no results.
  • “dress shirt” and “shirt dress” return the same results even though the meaning is very different

In a traditional sense the goal of search was to take a query and try to find occurrences of it in a set of items, much like the index in the back of a book. This assumes a symmetrical relationship between the query and result text, i.e. you search with the answer, not with the question. Symmetry assumes you already know the answer.

The context of keywords is typically not useful enough to determine the searchers intent. Take the simple example of “bank”. When someone types this, they could mean:

  • Financial institution
  • Side of a river
  • Basketball shot
  • An aeroplane turning

Above is a good example of polysemy. This can also be extended to asymmetry. For example if someone searches for “plane turning” this may not return a result that says “plane banked”, yet the meaning is similar. “Plane” itself is also an example of polysemy and an abbreviation of “aeroplane”!

Compound term processing works to combine terms into groups that have their own meaning that is different from the individual terms. One example is “new jersey”, which has totally different meaning to “new” AND “jersey” as individual terms. In practice keyword search usually handles the compounded queries well as it typically requires all terms to match, scoring sequences higher than containing all individual terms. However it struggles with partially compounded terms, “bank” being a great example. This will match all contextual occurrences of “bank” as there is no way to determine which context is correct.

Note: the above is also assuming queries are treated as AND (require all terms to match). In practice some keyword search uses OR, which can match any of the query terms and is thus far more likely to return contextually irrelevant results. Some search technology also uses a hybrid approach which treats some text as AND and other as OR, which can be smart or naive in nature. Boolean search is a way to give the searcher access to control how things are matched by allowing the use of syntax in the query such as quotes, “AND”, “OR” and “NOT” operators. This can be useful but is generally beyond comprehension for the average person searching.

The ways that keywords fail when searching are endless. People have spent their lives writing rules, dictionaries, synonym libraries and more. There are even “AI Search” engines that suggest synonyms you should add so they can try to keep up the facade, but this is a losing game. 

How concept search works

Keywords (and their associated tokens) are relatively binary in respect to search, particular words either exist or they do not. Concept searching is based on vectors. The mathematics of vectors allow for the measurement of closeness, thus the relationship of text is no longer binary but rather a distribution.

vector search measurement
How vector search measures closeness.

Text is represented as vectors, and text with close conceptual meaning share very similar vectors. Typically the vector orientation is used rather than the magnitude, so the angle between the vectors becomes a measure of similarity. This is called cosine similarity and would be very familiar to anyone who has done high school math! The only difference is vectors representing text use hundreds of dimensions, so it’s harder to visually represent as per above (2 dimensions).

Text to maths

How is text turned into vectors? Neural networks are used to look at word sequences and build vector based models that can convert text to vectors, called embeddings. There are many examples of these and more appearing all the time. For example, AirBnB uses embeddings to help power their similar listings feature.

Using concepts in search

Concepts are great, but they also blur out the query meaning, so keywords are actually still useful. Thus, state of the art search is actually built on what is called “hybrid retrieval”, which is a combination of keyword and concept based search.

Here are some of the ways we designed hybrid retrieval with Neuralsearch, our new AI engine. 

  • Sparse retrieval is based on keywords. This is analogous to the back of a book when you look up a word and which pages it appears on. There are some language tweaks like stemming, lemmatisation, stop words, synonyms and such, but for the most part the query either is or isn’t in the target results.
  • Dense retrieval is based on matrixes. Text is converted into math (vectors or hashes) and proximity is used to infer relatedness. This solves many issues with the exactness of keyword search, but is also very expensive to do. For sparse retrieval you look at the list of matching items to each keyword (typically this is a small number of items). For dense retrieval you don’t know from any one number in the vector/hash, so you need to scan lots of information. This has made it to date relatively cost prohibitive and slow (hashes are changing this though).
  • Hybrid retrieval combines dense and sparse retrieval. Keyword matches find the exact hits where symmetry is suitable (typically head query terms), dense retrieval fills in the long tail gaps and handles all the problems with keyword search described above. Dense retrieval removes the need for synonyms & most rules, understands questions (asymmetry) and much more.

The problems with dense retrieval from a performance standpoint are now mostly resolved using vector specific databases (Milvus, Pinecone, Faiss, Weaviate, etc) or newer and faster representations such as neural hashes . The main issue with vector specific databases is the need to maintain two standalone systems. True hybrid needs both retrieval techniques in one system and this is the new frontier of search. 

The result is that we can now offer search that is just as fast (and often faster) and more accurate than keywords-only. One of my favorite examples is running a query on a Best Buy dataset for the phrase “something to keep my beer cold.” If someone walked into your store and asked for “something to keep my beer cold,” you would know exactly what they mean. A keyword-only search engine would have a rough time. However, a hybrid retrieval engine, like Neuralsearch, is able to understand the concepts to deliver incredible results in 0.001043 seconds!

concept search example
An example of concept search.

 

Our demo site doesn’t contain any additional metadata. The terms “cold” and “beer” don’t appear on any of the records on the site, but Neuralsearch understands the concepts!

Search.io is an industry leader in concept search and has pioneered new methods of concept searching based on neural hashes. To see for yourself, signup for a free 14-day trial or contact us for a demo so we can show you what’s possible with your site!

Similar articles

AI / Machine Learning

Inside the Data Gym: The Science Behind Optimizing Search Results Automatically

What's New
AI / Machine Learning

Introducing Neuralsearch: An All-New Era in AI Search

What's New
AI / Machine Learning

New Neuralsearch Configuration Is Here