How does Solr work?
I have previously discussed the benefits of Solr's super-fast search platform, and why it makes life easier for businesses with thousands, or millions, of documents that need to be searched. I will now answer the question: how does Solr work?
As previously explained, Solr is not a database, but is a document storage and retrieval engine that enhances existing databases. Every piece of data in Solr is a document; this is anything that can be retrieved from a database, which will have at least one field defined in the schema.xml, with a particular field type.
Solr performs text analysis on certain content and search queries in order to determine similar words, understand and match synonyms, remove syncategorematic words ("a", "the", "of" etc.), and score each result based on how well it matches the query. This ensures the best results are returned first, and that customers do not have to scroll through countless irrelevant results to find the content they want to see. Solr accomplishes this by using an index that maps content to documents, rather than mapping documents to content as in a traditional database model. This inverted index is, at the heart, how search engines work.
The inverted database
To demonstrate how powerful the inverted index is, I will present some practical examples of issues a user can experience when searching on any website.
For example, if a user is searching an extensive book database for a book on how to get better at running, they will search “How to become a better runner.” With a conventional database, the query will look like this:
SELECT * FROM Books
WHERE Name = 'becoming a better runner';
Here, the book will not be found because the query doesn’t match the exact title. Although the book will be found in the situation below, any books with “a”, “better”, or “runner” in the title will also be found. This index method will return many results, which will make it difficult for a user to find the appropriate book.
SELECT * FROM Books
WHERE Name LIKE '%becoming%'
OR Name LIKE '%a%'
OR Name LIKE '%better%';
OR Name LIKE '%runner%';
A traditional database representation of multiple documents would contain a document ID, mapped to one or more content fields containing all the words/terms in that document. An exact match would retrieve the correct document. An inverted index reverses this model and maps each word/term in the index to all of the documents in which it appears.
Solr’s inverted index has some additional functionalities that brings the user the most relevant results. For example, Solr distinguishes between words, and understands linguistic variations (such as “becoming” vs “become”). Additionally, Solr understands synonyms, so related terms will show up in the results. Syncategorematic words, such as “a” or “the”, can be excluded from the query index. And, of course, the order of the search results is based on relevancy to the search term.
In order to create a great search experience, finding documents that match the search term is critical. A majority of customers are not willing to wade through page after page of search results to find what they’re looking for. In fact, only 10% of customers are willing to go beyond the first page of any given search on most websites, and only 1% are willing to navigate to the third page. So, not only is it crucial that relevant documents are retrieved, but also that these documents are displayed in a relevant order.
In order to assign a relevance score, Solr has implemented an algorithm that can be influenced anywhere:
Σ ( tf(t in d) • idf(t)2 • t.getBoost() • norm(t,d) ) • coord(q,d) • queryNorm(q)
t in q
t = term; d = document; q = query; f = field
tf(t in d) = numTermOccurrencesInDocument
idf(t) = 1 + log (numDocs / (docFreq +1))
coord(q,d) = numTermsInDocumentFromQuery / numTermsInQuery
queryNorm(q) = 1 / (sumOfSquaredWeights )
sumOfSquaredWeights = q.getBoost()2 • Σ ( idf(t) • t.getBoost() )2
norm(t,d) = d.getBoost() • lengthNorm(f) • f.getBoost()
Term frequency (tf) is a measure of how often a particular term appears in a matching document, and indicates how “well” a document matches the term.
Inverse document frequency (idf) is a measure of how “rare” a search term is, and is calculated by finding the document frequency (how many documents the search term appears in), and calculating its inverse.
If you already have domain knowledge about your content, you know that certain fields or terms are more (or less) important than others, and that you can supply boosts at either indexing time or query time to ensure the weights of those fields or terms are adjusted accordingly.
The field normalisation factor (field norm) is a combination of factors describing the importance of a particular field on a per-document basis.
So, this is how Solr works at a glance. The inverted index, which maps each word/term in the index to all of the documents in which it appears, together with a smart algorithm to make sure relevance is being used, makes Solr a powerful search platform.