Harry Marr

Full text search with MongoDB

Here I’ll present a simple full text search engine, that uses MongoDB as its backend. It’s implemented using MongoEngine, and is intended as more of a proof-of-concept than a viable alternative to “real” search engines such as Solr, Sphinx, etc.

What will the search engine do?

The search engine will index documents in a certain MongoDB collection. Multiple fields may be indexed, and a weight may be assigned to each field. The words in each field will be split up and stemmed. A mapping between the words and the indexed documents will be stored in the database, and will be used when a query is run to determine the relevance of each document.

How should the index be stored?

Most search engines use inverted indexes, which map terms to documents. A simple way of storing an inverted index in MongoDB would be to have a collection where the _id is a term, and a documents field has a list of references to documents in the collection we are indexing. However, as MongoDB allows us to index on list fields, a more sensible approach is to store the reference to the indexed document as the _id in the index collection, and have a list field that contains the terms that are in the document. To allow terms from different fields to carry different weights, embedded documents containing the term and a weight will be stored in the terms field, rather than just the term. A nice side effect of this is that terms that appear multiple times within a document can be stored as one embedded document; the weight will be the sum of the weights of the individual occurrences of the term. Let’s see the code for that:

class SearchTerm(EmbeddedDocument):
    term = fields.StringField()
    weight = fields.FloatField()

class DocumentIndex(Document):
    doc_id = fields.StringField(primary_key=True)
    terms = fields.ListField(fields.EmbeddedDocumentField(SearchTerm))

Querying

To perform a search, a textual query will be split and stemmed in a similar manner to the indexed documents. Each document will be compared to the query to determine its relevance, then the documents will be returned with an associated “score”.

The ranking function I’ve opted to use is BM25. For efficiency, this will be executed on the server, rather than downloading the entire index and performing the ranking client-side in Python. To do this the ranking function will be written in Javascript and the exec_js method on a MongoEngine QuerySet will be used.

BM25 uses the inverse-document frequency (IDF) of each term to rank a document in a collection. The IDF effectively determines how important a term is across a collection by calculating its rarity (i.e. it will be low for common words, and high for words that only occur in a small number of documents). The IDF will be calculated before the main ranking occurs:

idfs = {}
# Get the total number of documents in the collection
num_docs = document_index.objects.count()
for term in query_terms:
    # Use the number of docs that contain the term to calculate the IDF
    term_docs = document_index.objects(terms__term=term).count()
    idfs[term] = log((num_docs - term_docs + 0.5) / (term_docs + 0.5))

Now we have a dictionary of the IDF for each term, we can define the ranking function:

function() {
    var results = {};
    // Iterate over each document to calculate the document's score
    db[collection].find(query).forEach(function(doc) {
        var score = 0;
        // Iterate over each term in the document, calculating the
        // score for the term, which will be added to the doc's score
        doc.terms.forEach(function(term) {
            // Only look at the term if it is part of the query
            if (options.queryTerms.indexOf(term.term) != -1) {
                // term.weight is equivalent to the term's
                // frequency in the document
                //
                // f(qi, D) * (k1 + 1)
                var dividend = term.weight * (options.k + 1);
                // |D| / avgdl
                var relDocSize = doc.length / options.avgDocLength;
                // (1 - b + b * |D| / avgdl)
                var divisor = 1.0 - options.b + options.b * relDocSize;
                // f(qi, D) + k1 * (1 - b + b * |D| / avgdl)
                divisor = term.weight + divisor * options.k
                // Divide the top half by the bottom half
                var termScore = dividend / divisor;
                // Then scale by the inverse document frequency
                termScore *= options.idfs[term.term];
                // The document's score is the sum of its terms scores
                score += termScore;
            }
        });
        results[doc._id] = score;
    });
    return results;
}

And that’s pretty much it, we get back a dictionary that has document ids as the keys and relevance scores as the values. In the future it would be nice to add sorting by relevance and a way of saying “only give me back the top n results”, but for the time being that can just be done in Python:

from operator import itemgetter
from heapq import nlargest
num_results = 10
top_matches = nlargest(num_results, results.iteritems(), itemgetter(1))

Optimisations

Firstly, as MongoDB is a schema-free database, it stores the field names along with each field on a document. As we are storing a large number of terms, renaming term and weight on the SearchTerm embedded document will save a fair bit of space. Secondly, rather than ranking all documents we could use a query that only includes documents that contain at least one of the search terms:

query = document_index.objects(terms__term__in=query_terms)

As I mentioned earlier, this will not perform nearly as well as the proper search servers, but it seems to produce reasonable results for the limited tests I’ve run. The full code for this is available on GitHub, along with an example of how to use it.