- Product Trial
I just got back from an exhausting but very enjoyable 5 day trip to the Bay Area where, as usual, I crammed in as many activities and meetings as possible.
I started out visiting a couple of colleges with my college-bound daughter who is planning to major in Biology (I’m sure that she’ll also be taking some chemistry courses as well). Then I visited friends, customers and prospects in my old haunting grounds (I lived in Silicon Valley most of the 80’s and the early 90’s). On Monday night one of my most senior employees drove 3 hours from Paradise (a small lovely town with a very cool name in the foothills of the Sierras) to have dinner with me in Fisherman’s Wharf. We took a cable car (his first in 30 years) to get from downtown to the Wharf area.
Lest I forget to mention, I did manage to squeeze in an afternoon this past Tuesday (April 4, 2017) at the 253rd National Meeting of the American Chemical Society (ACS). Let me digress for a minute. Speaking at the 253rd meeting of ACS got me curious as to when and where ACS held its first such meeting. So late Friday afternoon/evening I recruited Grace, Chemistry and Chemical Engineering Librarian at Stanford University to help me answer this question. Although ACS was formed in 1874, the first of these twice a year meetings wasn’t held until August 6-7, 1890 in Newport, RI.
Back to my talk, I was invited to present the paper – Unique One Stop Access to a Multitude of Chemical Safety Resources to a workshop put on by CHAS (Chemical Health and Safety) Division of ACS. The paper summarized and demonstrated two gateways (a Stanford version and a publicly available version) that my company developed, working closely with Grace, that aggregate Chemical Safety information.
Please check out the public gateway at:
and do send me feedback through the blog on how we can improve the gateway.
Finally, as I have been reflecting on the work in Chemical Safety that we’ve done it’s become clear that what we’ve done is most of the way towards being a powerful resource to help accelerate Chemical Research in general.
One of the more common questions that I get from prospects and customers alike is why don’t we bring back all results from each of the sources that we federate? Just earlier in the week one of the librarians at one of our newest customers asked this question. I went back to our blog archive and dug up this wonderful blog article that Darcy wrote in 2015 – Getting the Best Results vs. Getting all of the Results and sent it on to our customer. I love it when I can answer a customer or prospect question by sending them a link to a blog article that answers their question.
So this afternoon I decided to expand a bit on Darcy’s original blog article.
In an effort at transparency and to inform our users of the status of searching, the user can look at Search Status popup that displays the list of sources searched with the number of results returned and the number of results found at the source (when the source provides this information). This Search Status popup is a link under the progress bar in the upper left hand corner of the Results page – the text of the link indicates the count of all sources involved in the search, e.g., “54 of 54 sources complete.”
Viewing the Search Status popup, the user can see, for a broad query, e.g., security, that collectively the sources may have available several hundred to thousands of results while we only retrieved up to the first 100 results. It does beg the question of why we can’t bring back all the results.
So let us for a moment go directly to one of the more popular sources that we federate — PubMed, a very large database of 20 million medical articles (some full-text but mostly just meta-data).
Doing the following PubMed searches:
“myocardial infarction” — returns 213,186 results
“myocardial infarction” AND aspirin — returns 7,395 results
“myocardial infarction” AND aspirin AND statins — returns 542 results
Even with the most specific of the above queries, PubMed still returned 542 results, more results than most users will review, and certainly more than we would like to return from a source. However, we could retrieve the 542 result if we wanted to.
The above example illustrates one of my main responses to the question – Why do we not bring back all results? – What I say is that instead of focusing on Explorit Everywhere! bringing back more results, the focus should be on users realizing that the issue is to be more precise in their queries so that they are getting the most relevant results. It is not very useful to get all the results if they do not help the user find the answer they are looking for. Doing a broad search like “myocardial infarction” that found 213,186 results is not as helpful as doing a more precise search like “myocardial infarction” AND aspirin AND statins” with its 542 results. In the more precise search, the user is more likely to find a relevant answer.
In conclusion, when users issue more precise queries, they will find that Explorit Everywhere! returns most or all of the available results at each source, with the results ranked using our secret sauce so the user can quickly and easily find what they were looking for across all available sources. For the case where more results are available at the source and the user needs to examine all results (perhaps they are doing some legal due diligence) then the user can go directly to the source and conduct the search there.
My Biznar alert on Discovery Services recently deposited in my Inbox a link to this ProQuest blog article: A Guide to Evaluating Content Neutrality in Discovery Services. Although I have written about content neutrality before, most recently in October of 2015, in the blog article: The Last of the Major Discovery Services is Independent no More, because of the ProQuest blog article, I decided to revisit the topic in this blog article.
The blog article by ProQuest linked above and quoted below, raises my main concern about the content neutrality of Discovery Services (EDS, Primo and Summon) that are owned by companies whose main business is selling content:
A concern that some libraries may have is that discovery service providers, that are also content providers, have an intent and vested interest to funnel usage to their content. With the success of online services often based on usage metrics and the fact that the content sales model is driven by the “revenue follows usage” mantra, librarians should well be concerned about content neutrality in discovery services from such dual providers.
Also, in this ProQuest blog article, the author says – “ProQuest and ExLibris reaffirms our commitment to content neutrality in our discovery systems.”
Nowhere, however, have I been able to find any ProQuest write-up that backs up this claim that their Discovery Services are, in fact, content neutral. As one of our former Presidents, Ronald Reagan, was fond of saying – “trust but verify”. With that said, librarians should verify the content neutrality of their Discovery Services.
One test that I would encourage my readers to perform who have purchased a Discovery Service or have access to one is the following: Run 10 queries that cover different subject areas and record for each of the top 10 results where each of these top 10 results are coming from (EBSCO, ProQuest or another publisher). If a large percentage of your top 10 results in EDS are EBSCO results or a large percentage of ProQuest results are being returned by Primo or Summon among your top 10 results, then you have a content neutrality problem. I’d love to see your findings as comments to this blog article.
The NISO Working Group in their Open Discovery Initiative: Promoting Transparency in Discovery report makes a number of recommendations to Discovery Service vendors and librarians to help them evaluate and ensure the content neutrality of a Discovery Service. These recommendations are summarized in ExLibris’ A Guide to Evaluating Content Neutrality in Discovery Systems.
These recommendations include:
- Non-discrimination among content providers in how results are generated and relevance ranked.
- Non-discrimination in how links to results are ordered in a result list or made available via a link resolver. A potential problem might be how duplicate records are treated by the Discovery Service.
- Provide libraries with options to configure how links are labelled and displayed and how links to meta-data and full-text are provided.
In the ExLibris’ Guide, they state that “Content neutrality in a discovery system means that students and researchers are equally exposed to the entire wealth of information from all sources.”
Although, as you might expect, the Discovery Services don’t address the issue that content neutrality is seriously compromised by the inability of their Services to include in their indices ALL of the content that a library has available at the disposal of their students.
So, in conclusion, if you want to ensure the content neutrality of your institution’s Discovery Solution you should seriously consider an Explorit Everywhere! solution.
An Explorit Everywhere! solution provides your users with access to all of the content sources your library has licensed, ranked using our own publisher neutral algorithms (see Ranking: The Secret Sauce for Searching the Deep Web), with the display priority of duplicate results configurable. You might also want to include our partner’s Gold Rush publisher neutral link resolver.
In a January Sneak Peek blog article Darcy gave us a preview of some of what my engineers were working on. Now, I am excited to present to you faceted navigation, one of the coolest ever features added to Explorit Everywhere!
In an excerpt from Peter Morville’s Search Patterns (a classic on designing effective search focused User Interfaces published in 2010), Morville quotes Professor Marti Hearst (from UC Berkeley) as saying,
“Faceted Navigation is arguably the most significant search innovation of the past decade.”
Morville describes faceted navigation simply: “It features an integrated, incremental search and browse experience that lets users begin with a classic keyword search and then scan a list of results” (p. 95).
Our faceted navigation, combined with our clustering technology offers the researcher a more refined approach for zooming in to find the most relevant results from their search. When reviewing the cluster facets, which show other related terms to the search query, the researcher can narrow their results by selecting a Topic. With the Topic selected, the clusters are refreshed using those associated Topic results, and the researcher is presented with new facets only related to that selected Topic. It cuts out the noise, and allows the user to review very specific results.
Let’s now take a look at how faceted navigation works on one of our customer solutions at the University of the Arts, London.
We will start with a search for “Michelangelo” which returns 2,785 results (See Figure 1 above). And in the Topics list of clusters, we can see that there are several related topics: Art, Artist, Design, Sistine Chapel, David, Analysis, and so forth. These topics were derived from the results metadata returned from the 50 databases searched simultaneously.
By selecting the topic facet Sistine Chapel, the cluster facets were re-generated using the 81 results for that topic (See Figure 2). With this new view of the selected results, we now see more specific topics related primarily to Michelangelo’s Sistine Chapel. While the topic of Ceiling Frescos looks interesting, I am curious to focus on the images under the facet Document Type.
As the researcher explores their results, our faceted navigation generates “bread crumbs” that record the drill-down of steps taken. In Figure 3, we see the trail of selections we have made so far. Clicking on > Sistine Chapel will let me step back up, and step down into the Ceiling Frescos when I want to. See Figure 4 below for some of the interesting images I found of Michelangelo’s Sistine Chapel.
I and some of my staff have had the pleasure to work closely with Grace Baysinger, Head Librarian and
Bibliographer of the Swain Chemistry and Chemical Engineering Library at Stanford University, to develop a unique research gateway focused on chemical safety.
My relationship with Grace goes back two decades when I developed SciSearch@LANL, a precursor to Web of Science for Los Alamos National Laboratory and Grace was our customer representative at Stanford.
More recently we have worked closely with Grace on development of xSearch (Stanford’s name for Explorit Everywhere!, our largest federated search implementation.
I have asked Grace to give us an overview of this important chemical safety resource that we have developed together.
While chemists are one of the most intensive users of information, many are unfamiliar with chemical safety resources they should consult before working in the lab. Chemists consulting materials safety data sheets or safety data sheets (MSDS/SDS) discover that they often have “NA” or not available for physical properties that they need for their lab work.
Grace’s goal in working with Deep Web Technologies was to develop a research gateway that provides access to a wide collection of information sources focused on chemical safety. This gateway uses federated search technology, the ability to search multiple sources at one time, which helps users find the information they need more effectively and efficiently. It is possible to view results visually, move to a particular resource in the search results, and to set up an alert to be notified when new information is published on a topic. Common search terms include chemical name, CAS Registry Number, and searching topics using keywords. If a resource contains InChI or SMILES values for a chemical substance, it may be used as a search term too.
Moving soon from prototype to production, the Stanford University version of the chem safety gateway will be a collaborative effort between the Stanford University Libraries and Stanford Environmental Health and Safety. This gateway has 60+ information sources that includes SDS/MSDS, safety data, syntheses and reactions databases, citation databases, full-text eBooks and eJournals, plus a number of Health & Environmental Safety (EH&S) websites. While the SDS/MSDS and safety data resources form the core of this collection, curated databases such as Organic Syntheses, Organic Reactions, Science of Synthesis, Merck Index, Reaxys, and the e-EROS (Encyclopedia of Reagents for Organic Synthesis) include protocols and safety information that is useful to bench chemists. eBooks and eJournals are full-text searchable, allowing researchers to find property and safety information in handbooks and methods and protocols in journal articles. EH&S websites from selected universities plus websites for the ACS Committee on Chemical Safety, ACS Division of Chemical Health and Safety, and the U.S. Chemical Safety Board will help users discover information such as training materials, standard operating procedures, and lessons learned. Search results for a chemical name search also include the “Chemical Box” from Wikipedia in the right column.
At the ACS National Spring 2016 Meeting held in San Diego, Grace and colleagues from Stanford’s EH&S Unit gave a presentation on Using a chemical inventory system to optimize safe laboratory research in a Division of Chemical Health and Safety symposium. The first part of this presentation covers ChemTracker and the latter part (starting on slide 23) shows screen shots of the Stanford Chem Safety Gateway. Slide 24 has a list of the resources being searched in the Stanford gateway. For a current list of resources, please see Grace’s recent blog entry, Chemical safety resource gateway available.
Grace then helped the DWT team develop a public version of the Chem Safety Gateway that is available to test-drive at:
This public site searches a subset of the sources that are searched at the Stanford site as DWT is not able to search subscription sources through their public site.
Please test-drive the public version of the Explorit Everywhere! Chem Safety Gateway and give Abe feedback as to how useful it is to be able to search a broad set of chemical safety resources at the same time. Be sure to register (not required to search) to use the Alerts and MyLibrary features. Did you find that the gateway returned relevant results? What sources (subscription or public) would you add to make this gateway even better? Do you have any other suggestions for improving the gateway?
Please email your feedback to firstname.lastname@example.org
A couple of months ago I came across the claim that we are generating 2.5 billion GB of new data every day and thought that I should write a fun little blog article about this claim. Here it is.
This claim, repeated by many, is attributed to IBM’s 2013 Annual Report. In this report, IBM claims that in 2012 2.5 billion GB of data was generated every day of which 80% of this data is unstructured and includes audio, video, sensor data and social media as some of the newer contributions to this deluge of data being generated. IBM also claims in this report that by 2015 1 trillion connected objects and devices will be generating data across our planet.
So how big is a billion GB? A billion GB is an Exabyte (a 1 followed by 18 zeros), i.e., 1000 petabytes or 1,000,000 terabytes.
My research took me to this article in Scientific American — What is the Memory Capacity of the Human Brain? which I pursued to try to put into context all the huge numbers I’ve been throwing around. Professor of Psychology Paul Reber estimates that:
The human brain consists of about one billion neurons. Each neuron forms about 1,000 connections to other neurons, amounting to more than a trillion connections. If each neuron could only help store a single memory, running out of space would be a problem. You might have only a few gigabytes of storage space, similar to the space in an iPod or a USB flash drive. Yet neurons combine so that each one helps with many memories at a time, exponentially increasing the brain’s memory storage capacity to something closer to around 2.5 petabytes (or a million gigabytes).
So if I’m doing my math right, the 2.5 billion GB of information generated daily could be stored by 1,000 human brains and human memory capacity is still way higher than our electronic storage capacity.
My research then took me to this interesting blog article from 2015 – Surprising Facts and Stats About The Big Data Industry. Some of the facts and stats that I found most interesting in their infographic include:
- Google is the largest ‘big data’ company in the world, processing 3.5 billion requests per day, storing 10 Exabytes of data.
- Amazon hosts the most servers of any company, estimated at 1,400,000 servers with Google and Microsoft close behind.
- Amazon Web Services (AWS) are used by 60,000 companies and field more than 650,000 requests every second. It is estimated that 1/3 of all Internet users use a website hosted on AWS daily, and that 1% of all Internet traffic goes through Amazon.
- Facebook collects 500 terabytes of data daily, including 2.5 billion pieces of content, 2.7 billion likes and 300 million photos.
- 90% of all the data in the world was produced in the last 2 years.
- It is estimated that 40 Zettabytes (40,000 Exabytes) of data will be created by 2020.
Another interesting infographic on how much data was generated every minute in 2014 by some of our favorite web applications is available at – Data Never sleeps 2.0.
I would be remiss if I didn’t tie my blog article to what we do at Deep Web Technologies. So please take a look at our marketing piece – Take on Big Data & Web Sources with Deep Web Technologies. We’d love to hear from you and explore how we can feed content and data from a myriad of disparate sources to your big data analytics engine on the back end as well as explore how we can enhance the insights derived by your big data solutions by providing real-time access to content that complement these insights.
Every morning I wake up to a number of Alerts generated by a number of our portals including Biznar, Mednar and Science.gov. Yesterday morning one alert with the title – Is Google good enough for Medicine – caught my attention.
In their editorial commentary in the Journal of Neurology, Neurosurgery and Psychiatry, 3 medical professionals from down-under talk about how Google (now a verb in the Oxford English Dictionary) is changing the way that doctors practice medicine.
Here’s one anecdote that Dr. Cindy Shin-Yi Lin and her colleagues relate that I found interesting (and scary):
“In a recent letter, a rheumatologist describes a scene at rounds where a professor asked the presenting fellow to explain how he arrived at his diagnosis, ‘I entered the salient features into Google, and [the diagnosis] popped right up’.”
The authors of the editorial also talk about:
“Most clinicians will be familiar with the increasingly frequent scenario of a patient entering the consult room with a sizeable stack of printed webpages containing symptoms, pictures and a dreaded list of potential (and often grave) diagnoses that will undoubtedly commit the clinician to an arduous task of analyzing (and not infrequently, refuting) this information with the ‘cyberchondriac’ patient.”
If this topic interests you check out this blog article that we published last year – Relying on Google for Science Information is Bad for your Health, and if you want to bring more authoritative stacks of paper to your physician in your next visit try out our freely available medical research site, Mednar which searches 40+ sources of quality medical information all at the same time.
Oh, and I love that mug!!
I just came across this Fall 2014 survey conducted by one of our competitors who shall remain nameless (at least until you click on the link at the end of this blog post) which I found interesting and wanted to share with our readers.
Our competitor surveyed members of SLA (Special Library Association), mostly members of the PHT (Pharmaceutical & Health Technology) Division on whether they use federated search today and if not would they find federated search useful and what features would such a federated search solution need to have.
Question 2 of the 6 question survey asked – “Does your information center provide a ‘federated search’ function that allows users to search *all* of your organization’s online content resources with a single query? 81% of the respondents said “No”.
Explanations as to why federated search has not been more adopted in pharmaceutical companies, even though these companies have a wealth of content to access (and accessing this content is not so easy), seems to have to do at least to a large extent, with the lack of IT cooperation with the Knowledge Management or Information Services Group.
Marydee Ojala, Editor-in-Chief of Online Searcher and one of my favorite people at Information points out that:
“Searchability of electronic resources has long been piecemeal, but a federated search solution must take into account the IT infrastructure already in place.”
I have certainly been on a number of prospect calls which included IT where the was a lack of understanding of the power of federated search and a reluctance to add yet another tool to the set of tools/software that IT needs to support. What I propose to many prospects is that we start with a solution that is hosted by us in the cloud, minimizing the involvement of IT. Once the value of our service is proven to our customer if they now want to add internal content to their Explorit Everywhere! subscription and public content that we are searching from the cloud then we can move our solution to servers sitting behind the customer’s firewall.
When asked in question 5 of the survey, what respondents thought were the main drawbacks of not having federated search, the answers – which I was very happy to see – included:
87.50 % — Time spent looking multiple places for information
71.88 % — Missed information / opportunities due to “inexpert searching”
68.75 % — Reduced usage of online information sources
65.63 % — Over-reliance on search engines as primary research tool
All-in-all this is a very interesting survey. The survey results and analysis is available starting on page 18 of the Fall 2014 CapLits newsletter.
A couple of weeks ago I woke up to an email message from one of our partners in Europe asking if we could federate the enclosed list of sources for one of their prospects. Before I had a chance to respond to his message, my partner followed with a second email saying that he forgot to include the prospect’s EBSCO Discovery Service as one of the sources for us to federate. As I reviewed the list of sources that we would need to federate for this prospect I found that a couple of these sources were Ex-Libris Primo Discovery Services.
What a great example of potential co-opetition, or “cooperating with one’s competitors”. In this case, co-opetition with EBSCO and Ex-Libris (now part of ProQuest) to build a comprehensive solution for a customer that provides one-stop access to content from 3 different Discovery Services as well as some additional sources, something that neither EBSCO nor Ex-Libris could do. This use case gives new meaning to my earlier blog post on Federating the Un-Federatable.
Taking a look at the major Discovery Services we find that Summon has always been a pure Discovery Service, choosing not to complement their Discovery Service with federated search (even though they acquired two federated search companies – WebFeat and Serials Solutions). EDS and Primo have been hybrid services, enabling, although not so well, federation of sources not available in their indices. We’ve seen both EDS and Primo de-emphasize federated search in their Discovery Service, perhaps because federated search is not so easy to do well if it is not your product’s primary focus. OCLC’s WorldCat Discovery Service does not incorporate federation as part of their Discovery Service.
So this opportunity to build a solution for this project that federates 3 Discovery Services and some other sources has energized me, David, to reach out once again to the Discovery Services Goliaths. I have had numerous conversations with customers and prospects where I have heard repeatedly that important content is missing from their Discovery Service. I want to see if now is a good time to do some co-opetition that is a win-win for everyone, especially for the user who wants one-stop access to all of the content that they need and doesn’t care how that content is aggregated.
Many mainstream libraries have a standard list of sources they search: EBSCO, ProQuest, ScienceDirect, PubMed, and others. Including the usual “suspects” in a single search isn’t terribly hard for most discovery solutions (federated search or discovery services) unless they are competing information vendors who do not want to play nice – see this post: The Last of the Major Discovery Services is Independent No More.
But, once information journeys down the road less traveled, sources are less likely to be included in a discovery solution. The information may be considered more valuable by content owners and so they are reluctant to include it in an indexed service. If the content is specialized, interest in the source may be limited to only a small group of customers, excluding the source from discovery services and most federated search vendors where broad appeal prevails. Some sources also may be technically challenging to include in a discovery solution.
These sources are considered un-indexable and un-federatable.
Fortunately, Explorit Everywhere! specializes in un-federatable sources. We connect to databases that most discovery service vendors steer clear of, and most federated search services are simply unable to touch with their lightweight connector technology. The robust connectors included in Explorit Everywhere!, however, can tackle the most ornery sources for special libraries and research organizations.
For example, one of our newest customers chose DWT as their discovery solution vendor because they were mired in this un-federatable dilemma. Their previous discovery solution vendors were not able to include quite a number of their important subscription information sources, leaving researchers to inefficiently spend time searching those sources separately. Once our prospect tested the proof-of-concept built by DWT, they realized that DWT could federate their un-federatable sources and chose Explorit Everywhere! as their discovery solution.
This particular customer, specializing in military and defense research, included challenging sources like these two:
- JANES – IHS Jane’s International Defence Review, specializing in defense and security.
- PERISCOPE – Includes open-source global defense information.
Really, what’s the point of having a single search technology if only half of your sources can be included? Explorit Everywhere! federates the unfederatable. Not only do our connector developers consider it a challenge to build robust connectors to these specialized sources, DWT’s focus has always been to connect our customers to their information, wherever it may be. New customers requiring “unfederatable” sources are pleasantly surprised when they find that not only are we willing to build connectors to their difficult sources, but we don’t charge extra to do so.
Interested in hearing more on how we connect to your “un-federatable” information sources? Email us, or find us at one of our upcoming conferences.