Deep Web Tech Blog

Ranking: The Secret Sauce for Searching the Deep Web

One of the most powerful features and benefits of Deep Web Technologies’ Explorit Federated Search, is its ability to rank the results from the myriad of collections that might be included in a federated search (a.k.a. deep web search). This is useful for two reasons. First, it helps rank results from sources that don’t otherwise rank results. The ever popular PubMed, a service of the U.S. National Library of Medicine and the National Institutes of Health, is an example of this. PubMed doesn’t rank its results. As a consequence, any search service that provides results from PubMed that do rank results, such as our, adds tremendous value.

Stemming is the process of converting words to their base – or root – words. In the simplest case, it makes sure that a pluralized search term will find singular terms in the results, and visa-versa. This can be simply dropping “s” or “es” from words (in English), but the process can become more complex. Consider “mouse/mice” and “person/people”. The specific stemming algorithm we use is the Porter Stemming Algorithm. For the most part, we do not need to stem search terms before submitting them to the collections we search.  Occasionally, we may need to explicitly indicate to a collection that we want to perform a stemmed search or an exact search.

(2) Conducting Relevance Weighting

We analyze search term occurrence within a search result, and assign weights for different factors. We look for occurrence of exact terms and stem terms. We can assign relative weights to different results fields.  We can also assign higher weights to results from a more important collection as well as assign a higher weight to more recent results. We also consider:

  • Search Term Position – We examine where search terms appear within particular fields (i.e. title, author, snippet) and affording special consideration for whether a search term occupies the first word position, last word position, or relative position to either.
  • Search Term Density – We find significance in how often search terms appear within fields (i.e. individual fields and full record). Aside from counting the number of occurrences of search terms within fields, we consider the ratio of search term length to result field length. For example, a one-word title that is the same as the search term would be highly relevant.
  • Search Term Proximity – We consider how close search terms occur relative to one another. When evaluating this, we look at the number of search terms within the query expression and the distance between reoccurring search terms. In returned results, this ratio, in conjunction with the length of the fields, can be significant.
  • Search Term Ordinality – If search terms are in the same order, as was specified in the search expression, this can be significant and is afforded greater weight than if the search terms are not in order as the search expression. Likewise, multiple occurrences of ordinality are important.

(3) Proprietary Algorithms

Once we’ve analyzed the exact search terms and stemmed search terms, against the factors above and assigned weights, we use our proprietary algorithms to assign an actual rank. These algorithms operate on the Boolean operators AND, OR and NOT. The search query expression is evaluated from left to right. Exact phrases (contained within double-quotation-marks) are not stemmed! If a date range is specified, the date is used as a constraining term, provided that a date is supplied in a result. If a date is not supplied in a result, the relevance for that result is assumed zero (i.e. not ranked). Note that such results may still show in the results list. Finally, stop words are words considered irrelevant for searching purposes. We don’t evaluate them. The current list of stop words is: a, about, again, all, almost, also, although, always, among, an, another, any, are, as, at, be, because, been, before, being, between, both, but, by, can, could, did, do, does, done, due, during, each, either, enough, especially, etc, for, found, from, further, had, has, have, having, here, how, however, i, if, in, into, is, it, its, itself, just, made, mainly, make, may, might, most, mostly, must, nearly, neither, no, nor, obtained, of, often, on, our, overall, perhaps, quite, rather, really, regarding, seem, seen, several, should, show, showed, shown, shows, significantly, since, so, some, such, than, that, the, their, theirs, them, then, there, therefore, these, they, this, those, through, thus, to, upon, use, used, using, various, very, was, we, were, what, when, where, which, while, who, why, with, within, without, and would.

In Summary

Deep Web Technologies utilizes a strong ranking algorithm, that considers a number of factors and assigns relative weights, to the relationship between the search terms and the results. To some extent, weights can be modified according to client preferences, and in all cases, ranking can add tremendous value to a federated (or deep web) search.