Controlling Search Scores

Author: Dave Cassel  |  Category: Software Development

Tonight’s post is based on a question from Amit, a reader:

I am back again with a doubt. Now we know that the default algorithm used by Marklogic is “score-logtfidf” for the relevance calculation.  Now in my project, this is the default option as well, but what it is doing is it is giving preference to documents which are of lower size and hence non-scientific materials are taking precedence in the result set. How can i change the score so that non scientific articles get a negative precedence.

Amit also sent me the Search API options that he’s using.

Scoring Algorithms

As described in the documentation for cts:search(), the default scoring algorithm score-logtfidf, with the formula:

log(term frequency) * (inverse document frequency)

There’s also score-logtf, which skips the inverse document frequency factor, along with score-simple and score-random, neither of which use term frequency or IDF. The score-random algorithm is great for getting a random selection of data. Let’s take a closer look at the terms of the default algorithm.

Term Frequency

The idea of term frequency is quite simple: how many times does a term appear in a document? However, we don’t just take the raw number: doing so would artificially inflate the scores of longer documents, all things being equal. Think about a search for “cat”. Suppose one document has 100 words and two of them are “cat”, and another document has 5,000 words and two of them are “cat”. The shorter one is more likely to be focused on cats. The raw count is adjusted in two ways. First, MarkLogic uses the log of the term frequency count. Second, the log of the count is scaled based on the length of the document. It’s precisely this feature that Amit is looking to avoid. Let’s take a look at the other major factor and then we’ll come back to this.

Inverse Document Frequency

The default scoring algorithm also accounts for how common a term is throughout the corpus. Suppose that you search for both “cat AND feline”. In most corpora, “cat” will be much more common than “feline”; all else being equal, the standard scoring algorithm will award more points to “feline” references than “cat” references. Being a more distinctive term, it’s considered more important. You can find a fuller description of TF/IDF on wikipedia.

Term Frequency Scaling

As mentioned above, the term frequency is scaled based on the length of the document. The database configuration page in the Admin UI provides a setting to control how that is done. The “tf normalization” setting offers these choices:

  • unscaled-log
  • weakest-scaled-log
  • weakly-scaled-log
  • moderately-scaled-log
  • strongly-scaled-log
  • scaled-log

These values provide a range of scaling, from the full, default scaling (scaled-log) to unscaled-log, which does not scale the term frequency based on the length of the document. Something to remember with these settings is that they are not set for each search; rather, the choice is set in the database configuration. When changed, the database needs to reindex, so this is definitely not something to change at runtime. Instead, you make a choice about how you want search to work on a particular database and stick with that.

The Answer

So the answer for Amit is to set the “tf normalization” setting for his database to unscaled-log. This will remove the scaling that adjusts for the length of documents, and should allow his longer documents to get higher scores.

(There are other approaches, such as setting the document quality of unfavored documents to something lower than 0, but I think this addresses Amit’s specific need.)

Tags: , ,

2 Responses to “Controlling Search Scores”

  1. Amit Says:

    Hi David,

    Thank you much for the clear insight on this topic. But what if reindexing is not an option, and a user cannot reindex his/her database, what approach will be required then to solve this issue?

    Regards
    Amit

  2. Dave Cassel Says:

    Amit, if reindexing is not an option, two other approaches come to mind. If you simply want to restrict the search results to the scientific articles, you can make that another part of the search, either always (use Search API’s <additional-query> element) or by letting the user control that (provide a document type facet so the user can choose to see only the scientific articles). A different approach if you always to include non-scientific articles, but you want to decrease their scores, is to adjust their document quality — give the non-scientific articles a negative value.

    I’m inclined to suggest the facet approach, but the right choice depends on the details of your situation.

Leave a Reply