How does Solr work?
In my previous post I discussed why Solr is needed and discussed the benefits of this super-fast search platform, and why it makes life easier when you have thousands or millions of documents that need to be searched. In this second part I will zoom in on how Solr works.
First, a recap: Solr is not a database, but it enhances the existing one. Solr is a document storage and retrieval engine. Every piece of data in Solr is a document. A document can be anything you wish to be retrieved from a database. It will have at least one field, which will be defined in the schema.xml with a particular field type. So, one can say that a document is a collection of fields that map to field types defined in a schema.
Solr performs text analysis on certain content and on search queries to determine similar words, understand and match on synonyms, remove unimportant words like “a,” “the,” and “of,” and score each result based upon how well it matches the incoming query to ensure that the best results are returned first and that your customers do not have to page through countless less-relevant results to find the content they were expecting. Solr accomplishes all of this by using an index that maps content to documents instead of mapping documents to content as in a traditional database model. This inverted index is at the heart of how search engines work.
The inverted database
To show how powerful this inverted index is, I would like to show some practical examples of searching issues that a user may encounter. Let’s say the user is searching a huge book database and they want to become a better runner, so they searches: How to become a better runner
With a conventional database the query looks like this:
SELECT * FROM Books
WHERE Name = 'becoming a better runner';
Here, the book will not be found because the query doesn’t match the title exactly.
SELECT * FROM Books
WHERE Name LIKE '%becoming%'
OR Name LIKE '%a%'
OR Name LIKE '%better%';
OR Name LIKE '%runner%';
In the situation above, the book will be found – along with all the books with “a”, “better” and “runner”, in the title. No doubt, you can imagine that there will be many results and that it will be hard to find the appropriate book in these results!
A traditional database representation of multiple documents would contain a document’s ID mapped to one or more content fields containing all of the words/terms in that document. An exact match would retrieve the correct document. An inverted index inverts this model and maps each word/term in the index to all the documents in which it appears.
Next to the inverted index, Solr has some additional functionalities to bring the user the most relevant results. For example, Solr distinguishes between words - it understands linguistic variations, such as “becoming” versus “become”. Additionally, it understands synonyms, so related terms will show up in the results as well. Unimportant words such as “a”, “an” or “the“ can be excluded from the query and index. Yet the powerful ordering in the results based on relevancy is retained.
Finding matching documents is a critical step in creating a great search experience. Most customers aren’t willing to wade through page after page of search results to find the documents they’re seeking. In our general experience, only 10% of customers are willing to go beyond the first page of any given search on most websites, and only 1% are willing to navigate to the third page. So, it is very important that not only the relevant documents are retrieved, but also that these documents are displayed in order of relevancy.
To get to this relevance score, Solr has set up an algorithm that can be influenced here and there (we’ll talk about that in part 3). I won’t bore you with the actual algorithm, or maybe I will..!
Σ ( tf(t in d) • idf(t)2 • t.getBoost() • norm(t,d) ) • coord(q,d) • queryNorm(q)
t in q
t = term; d = document; q = query; f = field
tf(t in d) = numTermOccurrencesInDocument
idf(t) = 1 + log (numDocs / (docFreq +1))
coord(q,d) = numTermsInDocumentFromQuery / numTermsInQuery
queryNorm(q) = 1 / (sumOfSquaredWeights )
sumOfSquaredWeights = q.getBoost()2 • Σ ( idf(t) • t.getBoost() )2
norm(t,d) = d.getBoost() • lengthNorm(f) • f.getBoost()
Term frequency (tf) is a measure of how often a particular term appears in a matching document; it’s an indication of how “well” a document matches the term.
Inverse document frequency (idf), a measure of how “rare” a search term is, is calculated by finding the document frequency (how many documents the search term appears in), and calculating its inverse.
If you have domain knowledge about your content, ie. you know that certain fields or terms are more (or less) important than others, then you can supply boosts at either indexing time or query time to ensure that the weights of those fields or terms are adjusted accordingly.
The field normalization factor (field norm) is a combination of factors describing the importance of a particular field on a per-document basis.
So, at a glance, this is how Solr works. The inverted index, which maps each word/term in the index to all of the documents in which it appears, together with a smart algorithm to ensure relevance is applied, makes Solr a powerful search platform. In the third and final part, I will discuss how to optimise it and how to get the best out of Solr.