Numerous myths and misunderstandings exist about federated search; For example, how does real-time search differ from an indexed search and why is this important? At Deep Web Technologies, we welcome your questions and we feel confident that once you recognize the value of the technology and understand our offerings you’ll choose Deep Web Technologies.
We invite you to read the questions and answers below and to contact us with questions we didn’t cover.
|Federated Search and My Business|
|How does federated search fit in with enterprise search?
How does federated search improve my bottom line?
How do you connect me to my knowledge?
The Deep Web is the set of web sites and their documents that cannot be accessed via crawler-type search engines such as Google. Deep web content typically lives inside of databases, and is accessed through search forms. Wikipedia has a good article about the Deep Web.
Federated search is a powerful way to comprehensively search multiple databases in real-time. Instead of crawling and indexing static content like Google (or the other popular search engines), Explorit federated search queries select, high quality collections to search simultaneously. While this usually takes a few seconds longer, it ensures a superior level results. For instance, federated search helps researchers avoid outdated articles and spam, allowing for the exploration of only the most pertinent information. Also, federated search enables private or other collections that can’t be indexed, to be searched (this is more common than you might imagine). Wikipedia has a good article about federated search.
Federated search engines use software “connectors” to access information sources. The federated search engine takes the user’s search query, transforms the search terms to match each content source’s requirements, and submits the query to each of the sources simultaneously. When the search results come back from each of the sources, the federated search engine merges them together into a single set of results pages with a unified look and feel.
Although federated search technically refers to the simultaneous search of multiple content sources regardless of how the content is accessed, the reality is that federated search is often performed on deep web content sources.
There is a large amount of content that is not available to crawl-type search engines like Google. Federated search engines, in particular ones that perform deep web searches, are required to access this additional content.
There are many scientific, technical, and business databases whose contents are not available to Google. Many but not all of these are subscription databases.
As users demand applications that can search more sources of content from fewer search pages it will only be a matter of time before the distinction between federated search and enterprise search disappears. Deep Web Technologies’ Explorit Research Accelerator federated search application can be configured with custom connectors to search a number of repositories normally accessed via enterprise search providing seamless access to more content than federated search or enterprise search alone provides.
Federated search solutions reduce the time it takes to find relevant information, decreases the chance of missing relevant content and improves the utilization of paid content. Additionally, the time needed for researchers to learn the quirks of numerous search interfaces is eliminated. The time saved from not needing to search multiple sources plus the improved quality of documents found translates to labor and cost savings.
Federated search connects directly to your knowledge sources, whether they be subscription, internal or public. Information that you would normally search to by going to the source, federated search uses connectors for, which retrieve the results and returns them to you. In this manner, searching ten, twenty or more of your knowledge sources in real time is simply a matter of typing in your query. You are assured that the information returned is the same information that is on the source.
A connector is a piece of software that is written to access a content source. A connector must know the URL of the source, how to send search commands, what the search syntax is, and how to process the search results that are returned from a source. Connectors can be challenging to write if access to a source requires handling multiple steps, URL redirection, cookies, sessions, or authentication methods.
Duplicate results from multiple sources should be removed to improve the user search experience. Deep Web Technologies’ Explorit Research Accelerator application is flexible in its approach to de-duplication. Results with identical URL’s can be considered duplicates as can results that have the same title and author. The configuration of the de-duplication algorithm should be customized to the particular databases being searched based on how duplicates manifest themselves in the results, i.e. what fields are being duplicated. No one solution fits all deployments.
Because Explorit searches in real-time, it is dependent on the speed of the knowledge sources to complete user queries. Sometimes, that can take 10, 20 or even 30 seconds to complete. Explorit federated search speeds this up with “incremental search” by displaying the results from the fastest sources immediately while the search continues in the background on the slower sources. In this way, real-time quality results are not sacrificed for speed.
Deep Web Technologies has developed tremendous expertise in developing connectors to content requiring cookies, sessions, username/password authentication, and IP-based authentication. Our connectors perform the authentication steps just as they occur when a user accesses an authenticated database using a browser.
To answer this question, it’s important to keep in mind that Explorit is a federated search solution that searches other collections in real-time. These other collections are operated by other companies and organizations, and are all very different in their age, value, capability, speed and overall performance.
Explorit is a rare breed of federated search, in that it won’t make you wait forever while results are being compiled from all the collections being searched. Imagine if it did, and one or more collections were “offline?” You could wait a couple of minutes (or more), before seeing any results.
To speed up the search process, Explorit will display results immediately from those collections that provide results immediately. And, when the slower collections have provided their results, Explorit will ask you if you want to incorporate them into your search results.
The Collection Status list will display all collections that are searched in a particular query. There are two numbers in this list, and they are “Results” and ‘Totals.” “Results” indicates the number of relevant results retrieved from each collection for your search, while “Totals” represents the number of results that each collection indicates exist in their database. Not all collections provide totals. Using the Collection Status feature can help users to discover which individual sources with large numbers of total results should be searched further, or to eliminate sources that don’t have many results applicable to a query.
There are some collections that may have tens or even hundreds of thousands of results displayed under the “Totals” column. Explorit usually limits the results retrieved to the application to 200 or less, depending on the collection. This is done for a number of performance reasons. If your Explorit application pulled in all of the results from all of the collections, it could potentially have millions of results to pull in, sort, de-duplicate, rank and display for each and every search performed. The time it would take would be quite lengthy and rarely are these results useful.
Instead, the information in “Collection Status” should be used as a discovery tool to understand what sources are relevant to their research.
Beyond the ability of simple sorting to organize search results by author, date or publication, clustering allows users to organize search results by topic. Our smart clustering software is able to organize the results into topics and subtopics at the time of search, allowing researchers to drill down into details of a topic in an intuitive way.
A Deep Web Technologies’ Explorit (TM) application has sophisticated support for mapping user search fields to fields supported by the remote source. The customer can decide which field or fields to search on the remote host on a per source basis in a number of flexible ways. Fields not available on the remote source can be ignored or the search engine can search different fields instead.
Deep Web Technologies has developed a number of sophisticated relevance ranking algorithms. Our algorithms take into account the document source, the frequency of user search terms in various fields of the search results, and other factors. We compare search results from different sources against one another to determine ranks of individual results against the result sets. Users find our relevance ranking to be quite good, often better than that provided by the content provider.
By default, results are displayed in a relevance-ranked results list, with those highest ranking on top. There are several factors that contribute to a higher ranking, including: the length of the title, the occurrence of the search term within the title and snippet, and the frequency of occurrence. If a collection presents highly relevant results for a query, Explorit will automatically retrieve additional results from that collection instead of only returning the first page of its results. Because native collection relevance ranking engines vary, Explorit’s sought-after ranking approach normalizes results from multiple collections for a consistent, relevance-ranked results list.
Deep Web Technologies provides flexible hosting and maintenance options. For some customers we host their applications in our state-of-the-art data center, providing backups as well as application, operating system, hardware, and connector monitoring and support. With a hybrid installation, customers provide the hardware and operating system and we install, monitor and maintain the application on their behalf. Other customers license the application and host and maintain it themselves.
Alerts are simply search terms that you have defined, that run on a schedule and are delivered directly to your email inbox or RSS feed. Explorit alerts can be set up for daily, weekly or monthly delivery and always deliver the freshest results from your knowledge sources, regardless of whether or not your sources have an alerts system.