Deep Web Tech Blog

  • Ranking: The Secret Sauce for Searching the Deep Web

    One of the most powerful features and benefits of Deep Web Technologies’ Explorit Federated Search, is its ability to rank the results from the myriad of collections that might be included in a federated search (a.k.a. deep web search). This is useful for two reasons. First, it helps rank results from sources that don’t otherwise rank results. The ever popular PubMed, a service of the U.S. National Library of Medicine and the National Institutes of Health, is an example of this. PubMed doesn’t rank its results. As a consequence, any search service that provides results from PubMed that do rank results, such as our, adds tremendous value.

    Stemming is the process of converting words to their base – or root – words. In the simplest case, it makes sure that a pluralized search term will find singular terms in the results, and visa-versa. This can be simply dropping “s” or “es” from words (in English), but the process can become more complex. Consider “mouse/mice” and “person/people”. The specific stemming algorithm we use is the Porter Stemming Algorithm. For the most part, we do not need to stem search terms before submitting them to the collections we search.  Occasionally, we may need to explicitly indicate to a collection that we want to perform a stemmed search or an exact search.

    (2) Conducting Relevance Weighting

    We analyze search term occurrence within a search result, and assign weights for different factors. We look for occurrence of exact terms and stem terms. We can assign relative weights to different results fields.  We can also assign higher weights to results from a more important collection as well as assign a higher weight to more recent results. We also consider:

    • Search Term Position – We examine where search terms appear within particular fields (i.e. title, author, snippet) and affording special consideration for whether a search term occupies the first word position, last word position, or relative position to either.
    • Search Term Density – We find significance in how often search terms appear within fields (i.e. individual fields and full record). Aside from counting the number of occurrences of search terms within fields, we consider the ratio of search term length to result field length. For example, a one-word title that is the same as the search term would be highly relevant.
    • Search Term Proximity – We consider how close search terms occur relative to one another. When evaluating this, we look at the number of search terms within the query expression and the distance between reoccurring search terms. In returned results, this ratio, in conjunction with the length of the fields, can be significant.
    • Search Term Ordinality – If search terms are in the same order, as was specified in the search expression, this can be significant and is afforded greater weight than if the search terms are not in order as the search expression. Likewise, multiple occurrences of ordinality are important.

    (3) Proprietary Algorithms

    Once we’ve analyzed the exact search terms and stemmed search terms, against the factors above and assigned weights, we use our proprietary algorithms to assign an actual rank. These algorithms operate on the Boolean operators AND, OR and NOT. The search query expression is evaluated from left to right. Exact phrases (contained within double-quotation-marks) are not stemmed! If a date range is specified, the date is used as a constraining term, provided that a date is supplied in a result. If a date is not supplied in a result, the relevance for that result is assumed zero (i.e. not ranked). Note that such results may still show in the results list. Finally, stop words are words considered irrelevant for searching purposes. We don’t evaluate them. The current list of stop words is: a, about, again, all, almost, also, although, always, among, an, another, any, are, as, at, be, because, been, before, being, between, both, but, by, can, could, did, do, does, done, due, during, each, either, enough, especially, etc, for, found, from, further, had, has, have, having, here, how, however, i, if, in, into, is, it, its, itself, just, made, mainly, make, may, might, most, mostly, must, nearly, neither, no, nor, obtained, of, often, on, our, overall, perhaps, quite, rather, really, regarding, seem, seen, several, should, show, showed, shown, shows, significantly, since, so, some, such, than, that, the, their, theirs, them, then, there, therefore, these, they, this, those, through, thus, to, upon, use, used, using, various, very, was, we, were, what, when, where, which, while, who, why, with, within, without, and would.

    In Summary

    Deep Web Technologies utilizes a strong ranking algorithm, that considers a number of factors and assigns relative weights, to the relationship between the search terms and the results. To some extent, weights can be modified according to client preferences, and in all cases, ranking can add tremendous value to a federated (or deep web) search.

  • Search Builder: Create Your Own Federated Search Engine

    Deep Web Technologies recently released its newest product and federated search enhancement: Search Builder.  Search Builder allows customers to create individual, personalized federated search engines.

    Customers use Search Builder to build a tailored search page by selecting whatever collections and search fields they desire, using those found in the federated search master application.  Search pages can be organized to cater to specific departments, workgroups, academic courses or individual researchers.  Not only can users quickly create their own federated search engine, they can also share or incorporate it into their own web page or blog using an easily added snippets of code, known as a widget.

    In summary, Search Builder allows users to:

    Add or remove collections at any time

    • Personalize federated search engines
    • Create new engines as often as needed
    • Make research more efficient by searching only important, relevant collections
    • Generate widgets for fast searching for more information.

  • Strategic Uses for Federated Search, Part 3

    In late July, I wrote a blog article entitled Strategic Uses for Federated Search, Part 2, which in turn referenced a sponsored article I wrote for CIL Magazine, discussing how federated search will eventually become a must-have in intellectual property research and litigation.

    Today, I wanted to discuss an often overlooked feature in our federated search platform, alerts, and how they provide a strategic advantage to anyone that cares about a specific word, phrase or concept in their careers.

    Examples How it works? How to use? Limited to certain sources or fields?

  • Page 12 of 12« First...«89101112