- Product Trial
One of the more common questions that I get from prospects and customers alike is why don’t we bring back all results from each of the sources that we federate? Just earlier in the week one of the librarians at one of our newest customers asked this question. I went back to our blog archive and dug up this wonderful blog article that Darcy wrote in 2015 – Getting the Best Results vs. Getting all of the Results and sent it on to our customer. I love it when I can answer a customer or prospect question by sending them a link to a blog article that answers their question.
So this afternoon I decided to expand a bit on Darcy’s original blog article.
In an effort at transparency and to inform our users of the status of searching, the user can look at Search Status popup that displays the list of sources searched with the number of results returned and the number of results found at the source (when the source provides this information). This Search Status popup is a link under the progress bar in the upper left hand corner of the Results page – the text of the link indicates the count of all sources involved in the search, e.g., “54 of 54 sources complete.”
Viewing the Search Status popup, the user can see, for a broad query, e.g., security, that collectively the sources may have available several hundred to thousands of results while we only retrieved up to the first 100 results. It does beg the question of why we can’t bring back all the results.
So let us for a moment go directly to one of the more popular sources that we federate — PubMed, a very large database of 20 million medical articles (some full-text but mostly just meta-data).
Doing the following PubMed searches:
“myocardial infarction” — returns 213,186 results
“myocardial infarction” AND aspirin — returns 7,395 results
“myocardial infarction” AND aspirin AND statins — returns 542 results
Even with the most specific of the above queries, PubMed still returned 542 results, more results than most users will review, and certainly more than we would like to return from a source. However, we could retrieve the 542 result if we wanted to.
The above example illustrates one of my main responses to the question – Why do we not bring back all results? – What I say is that instead of focusing on Explorit Everywhere! bringing back more results, the focus should be on users realizing that the issue is to be more precise in their queries so that they are getting the most relevant results. It is not very useful to get all the results if they do not help the user find the answer they are looking for. Doing a broad search like “myocardial infarction” that found 213,186 results is not as helpful as doing a more precise search like “myocardial infarction” AND aspirin AND statins” with its 542 results. In the more precise search, the user is more likely to find a relevant answer.
In conclusion, when users issue more precise queries, they will find that Explorit Everywhere! returns most or all of the available results at each source, with the results ranked using our secret sauce so the user can quickly and easily find what they were looking for across all available sources. For the case where more results are available at the source and the user needs to examine all results (perhaps they are doing some legal due diligence) then the user can go directly to the source and conduct the search there.
Editor’s Note: This is a guest article by Michelle Powers, Director of Library Services at Career Education Corporation (CEC). The Colorado Technical University and a number of other schools owned and operated by CEC are Explorit Everywhere! customers.
In 2015, the librarians of Colorado Technical University wanted to investigate alternative options to our discovery search tool. While we were not unhappy with the system we had in place, we were often stymied by the inability to incorporate competing vendors into the system’s platform and create a truly seamless experience for our students who relied heavily on the ability to search multiple subscription resources at one time.
In short, we wanted a system that provided results from truly everything we had, based on relevancy, in a completely seamless fashion, through an easy-to-use interface.
To begin, it’s important to understand that CTU has 3 campuses: 1 online campus which serves students in a completely online environment and 2 ground campuses in which students can access a traditional library facility but rely heavily on the electronic database collection for academic research.
Each campus has a unique campus portal; therefore, there are 3 separate library portals. There are also some database differences, based on campus programs and other factors.
Our initial decision criteria included the following:
- We needed a system that we could implement across multiple campuses—giving our students and faculty who work in multiple campus environments the same experience
- We needed a system that would allow individual branding for our separate campuses
- The system needed to work well with a multitude of vendors
- They system needed to allow us to continue to track database usage from each campus
- We needed a system that could be embedded into our campus pages, which is behind a firewall and only available to authorized users
- We needed a system that would not require additional authentication from our students, such as a student ID, after they log onto the campus portals
- We needed the system to be intuitive for our users
- The system had to be affordable
We were at first hesitant to move back to a federated system, which is what we had in place prior to the implementation of the discovery system. Our federated system had burned images in our collective memory of clunky interfaces and groupings of results which confused users. However, after reviewing how Deep Web Technologies met the criteria outlined above we invited Abe to give us a demo.
We were impressed with Explorit Everywhere!’s easy-to-understand interface, and features like the Search Builder, the ability to categorize resources on the search page, the ability to embed the search widget into our LibGuides, and more. What we liked most about the product though was that it was vendor neutral and promised to incorporate all of our resources in a way our previous system did not.
In early 2016 we decided to make the switch and launch the Explorit Everywhere! search tool in April at CTU and multiple other institutions owned and operated by CTU’s parent company, Career Education Corporation. This meant an implementation of Explorit Everywhere! on nearly 50 campus portals in less than 3 months!
Deep Web Technologies’ team provided clear instructions of what needed to be done on our end, met regularly with our IT team and library leadership to ensure our timeline was met, created systems on the back end to allow for our requested search features, and created a method of providing the library with statistics.
Our launch was astoundingly…. quiet. No upset student responses, no confusion or dismay at the new interface. Our students and faculty took to the new system like fish in water, which reinforced the library’s own opinion that the system was easy to use, and satisfied the needs of our users.
Abe and his team including Christy Ziemba, Ellee Wilson, and Susan Martin have been awesome at responding to our queries and resolving any situations we’ve encountered with such a massive change.
The library is still gathering statistics to accurately calculate changes to usage with the implementation of Explorit Everywhere! We did immediately have increased usage in an e-book database that was unavailable through our previous system, and are expecting to find an overall increase in database usage, especially in the other resources unavailable in the previous system
Librarians smarter and better than me can argue the pros and cons of a federated system vs a discovery system. I can say that our students ultimately benefit by the comprehensive search feature Deep Web Technologies has offered us.
Disclaimer: CTU cannot guarantee employment or salary. Find employment rates, financial obligations, and other disclosures at www.coloradotech.edu/disclosures.
My Biznar alert on Discovery Services recently deposited in my Inbox a link to this ProQuest blog article: A Guide to Evaluating Content Neutrality in Discovery Services. Although I have written about content neutrality before, most recently in October of 2015, in the blog article: The Last of the Major Discovery Services is Independent no More, because of the ProQuest blog article, I decided to revisit the topic in this blog article.
The blog article by ProQuest linked above and quoted below, raises my main concern about the content neutrality of Discovery Services (EDS, Primo and Summon) that are owned by companies whose main business is selling content:
A concern that some libraries may have is that discovery service providers, that are also content providers, have an intent and vested interest to funnel usage to their content. With the success of online services often based on usage metrics and the fact that the content sales model is driven by the “revenue follows usage” mantra, librarians should well be concerned about content neutrality in discovery services from such dual providers.
Also, in this ProQuest blog article, the author says – “ProQuest and ExLibris reaffirms our commitment to content neutrality in our discovery systems.”
Nowhere, however, have I been able to find any ProQuest write-up that backs up this claim that their Discovery Services are, in fact, content neutral. As one of our former Presidents, Ronald Reagan, was fond of saying – “trust but verify”. With that said, librarians should verify the content neutrality of their Discovery Services.
One test that I would encourage my readers to perform who have purchased a Discovery Service or have access to one is the following: Run 10 queries that cover different subject areas and record for each of the top 10 results where each of these top 10 results are coming from (EBSCO, ProQuest or another publisher). If a large percentage of your top 10 results in EDS are EBSCO results or a large percentage of ProQuest results are being returned by Primo or Summon among your top 10 results, then you have a content neutrality problem. I’d love to see your findings as comments to this blog article.
The NISO Working Group in their Open Discovery Initiative: Promoting Transparency in Discovery report makes a number of recommendations to Discovery Service vendors and librarians to help them evaluate and ensure the content neutrality of a Discovery Service. These recommendations are summarized in ExLibris’ A Guide to Evaluating Content Neutrality in Discovery Systems.
These recommendations include:
- Non-discrimination among content providers in how results are generated and relevance ranked.
- Non-discrimination in how links to results are ordered in a result list or made available via a link resolver. A potential problem might be how duplicate records are treated by the Discovery Service.
- Provide libraries with options to configure how links are labelled and displayed and how links to meta-data and full-text are provided.
In the ExLibris’ Guide, they state that “Content neutrality in a discovery system means that students and researchers are equally exposed to the entire wealth of information from all sources.”
Although, as you might expect, the Discovery Services don’t address the issue that content neutrality is seriously compromised by the inability of their Services to include in their indices ALL of the content that a library has available at the disposal of their students.
So, in conclusion, if you want to ensure the content neutrality of your institution’s Discovery Solution you should seriously consider an Explorit Everywhere! solution.
An Explorit Everywhere! solution provides your users with access to all of the content sources your library has licensed, ranked using our own publisher neutral algorithms (see Ranking: The Secret Sauce for Searching the Deep Web), with the display priority of duplicate results configurable. You might also want to include our partner’s Gold Rush publisher neutral link resolver.
To make Explorit Everywhere! more accessible, we have leveraged the Web Content Accessible Guidelines 2.0 (WCAG) accessibility guidelines provided by the W3C’s (World Wide Web Consortium—an international standards body to ensure the interoperability between web products. Accessibility refers to making sure that the design of products is usable for the widest range of abilities, such as for persons with different visual abilities, hearing abilities, or physical abilities.
Often the phrase “Section 508 compliance” is used in conjunction with ensuring accessibility, especially by government agencies. Section 508, which refers specifically to Section 508 of the Rehabilitation Act of 1973, which was amended in 1986, was passed to make sure that the Federal Government provides accessible electronic and information technologies to its employees—including computers, telecommunications, and so on, as well as access to web-based intranet and internet web pages and applications. More recently, though, the Assistive Technology Act of 1998 was passed to make sure that any state that receive Federal funding also adhere to some form of the Section 508 requirement.
WCAG 2.0 was written specifically for web content and web pages. It also leverages the same goals expressed in Section 508. In the WCAG 2.0, it has three priority levels of checkpoints where Priority 1 checkpoints must be met, Priority 2 should be satisfied, and Priority 3 may be addressed as part of compliance. In Explorit Everywhere!, we have confirmed that we meet most all of Priority 1 when applicable, many of Priority 2, and some of Priority 3 checkpoints. To better understand what changes we have made to meet these checkpoints, I am going to leverage WCAG 2.0’s four fundamental principles: Perceivable, Operable, Understandable, and Robust.
For the first principle, Perceivable, we have striven to make all components of the user interface evident on our web pages; that is, nothing is hidden. We have also used common iconography rather than inventing new ones that might not be understood so well. For each UI component, we have provided text alternatives, which includes both visual hover-overs on the pages and text alternatives for blind readers. And, based on UX feedback, we have rearranged the structure of the components so that the relationship between them makes more sense to the user or programmatically. Moreover, the meaning of our UI components are not meant to be contextual—that is, they are self-contained, nor are they dependent on color to convey meaning. We do use color contrasts to make the different components stand out more.
For our customer University of the Arts, London (UAL), being a specialized school for the arts, we have extended our application even further for their users by offering a selection of color themes for a user to select from, which is then saved to the user’s preferences (see Figure 1). These different options are designed specifically to make the interface more visually perceivable for different visual needs. UAL also asked that we add the ability to resize the text on the web page directly (as opposed to relying on the browser). These functions are available to any customer who wants to further extend their accessibility to offer more support.
For the Operable principle, we have focused on: keyboard accessibility, timing between functions, and navigable. The most basic operable support is to support keyboard access to all UI components, and to avoid any keyboard traps, that is, getting to a component that cannot be moved away from using the keyboard. Moreover, moving through the components does not require any specific timing for individual keystrokes. And, we have ensured that when initially landing on any page, there is a focused component for commencing keyboard navigation. When applicable, we offer more than one way to view results and to do certain functions since not all users do things the same way. One area for improvement that we intend to implement in the near future is the ability to jump pass entire blocks of content while keyboarding.
For the third principle, Understandable, we have ensured that Explorit Everywhere! is readable, predictable, and supportive of user involvement. While we believe we have made all the UI components readable and predictable, there is more work to do in making application errors more evident. For readability, we have reduced jargon, kept explanations simple, and avoided abbreviations. Predictability means not letting components randomly change in any way unless initiated by the user. This includes making sure that navigation and UI components are consistent across all functions. We have also expanded user preferences to allow users to save specific UI changes, and we intend to do offer more preferences to users in future releases.
The last principle, Robust, refers to supporting assistive technologies. Besides extending the keyboard accessibility, we have also reviewed our application in blind readers by making sure that our textual labels both as alt text and as hover-overs are accessible.
We believe that Explorit Everywhere! is even more perceivable, operable, understandable, and robust than ever!
I was pleasantly surprised and pleased when I woke up one recent morning to an email message from Nick Dimant, Managing Director of our partner PTFS Europe. My company and PTFS Europe were partners-in-crime in a unique (hopefully to be repeated many more times) collaboration at the University of the Arts, London (UAL).
Nick had sent me a copy of – An innovative approach to discovery (available here), a feature article in the June 2016 issue of Update, the monthly magazine of CILIP (Chartered Institute of Library and Information Professionals) by Karen Carden, Resources & Systems Manager, Library Services, UAL and by Jess Crilly, Associate Director, Content and Discovery, Library Services, UAL.
Carden and Crilly explain in their article in detail their justification and approach to implementing their Library Search solution which “brings together two separate products into (what looks like) a single interface for the user where they can search across our print and e-resources.”
Carden and Crilly discuss their selection of Explorit Everywhere! back in 2013 (which of course I love) in:
“After a great deal of research, discussion and testing we opted for an unusual – especially in the UK – next generation federated search tool. Like most libraries in the sector we had experienced first generation federated search, but found that this was quite a different experience.”
The authors describe UAL as a specialist university. What this means to me is that as a specialist university focused on the arts, a lot of the databases that UAL subscribes to are not mainstream databases and thus not included in the Discovery Services but easily federated by Explorit Everywhere!.
We give another example in Federating the Unfederatable of a specialist library, this time a defense/international policy focused university where Explorit Everywhere! provides the one-stop discovery of all the sources important to the library patrons, many not available through the Discovery Services.
If you’d like to read further on our Explorit Everywhere! solution at UAL check out these blog articles: Customer Corner – Paul Mellinger presentation, Promoting Explorit Everywhere! at UAL, and Faceted Navigation – UAL example.
In a January Sneak Peek blog article Darcy gave us a preview of some of what my engineers were working on. Now, I am excited to present to you faceted navigation, one of the coolest ever features added to Explorit Everywhere!
In an excerpt from Peter Morville’s Search Patterns (a classic on designing effective search focused User Interfaces published in 2010), Morville quotes Professor Marti Hearst (from UC Berkeley) as saying,
“Faceted Navigation is arguably the most significant search innovation of the past decade.”
Morville describes faceted navigation simply: “It features an integrated, incremental search and browse experience that lets users begin with a classic keyword search and then scan a list of results” (p. 95).
Our faceted navigation, combined with our clustering technology offers the researcher a more refined approach for zooming in to find the most relevant results from their search. When reviewing the cluster facets, which show other related terms to the search query, the researcher can narrow their results by selecting a Topic. With the Topic selected, the clusters are refreshed using those associated Topic results, and the researcher is presented with new facets only related to that selected Topic. It cuts out the noise, and allows the user to review very specific results.
Let’s now take a look at how faceted navigation works on one of our customer solutions at the University of the Arts, London.
We will start with a search for “Michelangelo” which returns 2,785 results (See Figure 1 above). And in the Topics list of clusters, we can see that there are several related topics: Art, Artist, Design, Sistine Chapel, David, Analysis, and so forth. These topics were derived from the results metadata returned from the 50 databases searched simultaneously.
By selecting the topic facet Sistine Chapel, the cluster facets were re-generated using the 81 results for that topic (See Figure 2). With this new view of the selected results, we now see more specific topics related primarily to Michelangelo’s Sistine Chapel. While the topic of Ceiling Frescos looks interesting, I am curious to focus on the images under the facet Document Type.
As the researcher explores their results, our faceted navigation generates “bread crumbs” that record the drill-down of steps taken. In Figure 3, we see the trail of selections we have made so far. Clicking on > Sistine Chapel will let me step back up, and step down into the Ceiling Frescos when I want to. See Figure 4 below for some of the interesting images I found of Michelangelo’s Sistine Chapel.
The Deep Web fascinates most of us and scares some of us, but is used by almost all of us. While over the past couple of years, more and more information has surfaced about the Deep Web, finding reputable information in those depths is still shrouded in mystery. Abe Lederman, CEO of Deep Web Technologies, wrote a guest article for Refer Summer 2016, republished in part below. Refer is an online journal published three times a year for the Information Services Group of the Chartered Institute of Library and Information Professionals (CILIP).
The Web is divided into 3 layers: the Surface Web, the Deep Web and the Dark Web. The Surface Web consists of several billion web sites, different subsets of which are crawled by search engines such as Google, Yahoo and Bing. In the next layer of the Web, the Deep Web consists of millions of databases or information sources – public, subscription or internal to an organization. Deep Web content is usually behind paywalls, often requires a password to access or is dynamically generated when a user enters a query into a search box (e.g. Netflix), and thus is not accessible to the Surface Web search engines. This content of the Deep Web is valuable because, for the most part, it contains higher quality information than the Surface Web. The bottom-most layer, called the Dark Web, gained a lot of notoriety in October 2013 when the FBI shut down the Silk Road website, an eBay-style marketplace for selling illegal drugs, stolen credit cards and other nefarious items. The Dark Web guarantees anonymity and thus is also used to conduct political dissent without fear of repercussion. Accessing the gems that can be found in the Deep Web is the focus of this article.
Michael Bergman, in a seminal white paper published in August 2001 entitled – The Deep Web: Surfacing Hidden Value, coined the term “Deep Web”. The Deep Web is also known as the Hidden Web or the Invisible Web. According to a study conducted in 2000 by Bergman and colleagues, the Deep Web was 400-550 times larger than the Surface Web, consisting of 200,000 websites, 550 billion documents and 7,500 terabytes of information. Every few years while writing an article on the Deep Web, I search for current information on the size of the Deep Web and I’m not able to find anything new and authoritative. Many articles that I come across still, like this article, refer to Bergman’s 2001 white paper.
Many users may not be familiar with the concept of the Deep Web. However If they have searched the U.S. National Library of Medicine PubMed Database, if they have searched subscription databases from EBSCO or Elsevier, if they have gone and searched the website of a newspaper such as the Financial Times or went to purchase a train ticket online, then they have been to the Deep Web.
If you are curious about what’s in the Deep Web and how can to find some good stuff, here are some places you can go to do some deep web diving…
The Professional Services department at Deep Web Technologies is extending its Account Management efforts to assist our customers in getting the most out of their Explorit Everywhere! (EE!) service. As with most any software service, it is possible to use its basic features easily, and EE! is no exception. Everyone who has EE! can do searches, review results, and create alerts with ease. But, we also all know that to really get the most out of any technology, it needs to become part of our habits. And for EE! that means having it available where you need it whenever you have a question or an information need. After all, while Google has become a habit for many, EE!, with its high quality, authoritative, and often expensive sources purchased by our customers for their users, needs to be that habit for those information needs and questions that Google can’t easily answer.
Explorit Everywhere! has several features that can help customers integrate their search application into their website. That way it’s there for their users wherever they need it. And folks at DWT want to help. Three quick and easy ways to better integrate your EE! searcher include:
- Add a link to your EE! Quick Search or Advanced Search page from a menu, icon, or text on different pages on your website.
- Integrate the EE! Quick Search Widget box on any website page, which will allow users to search from where they are. Then a new browser window or tab will open with the results. The widget is part of every EE! application.
- If your organization has specialized departments or research groups, then we recommend they use Search Builder. The Explorit Everywhere! Search Builder feature, available through the EE! menu and requiring an account, allows a department admin, research lead, or any faculty to create a search widget specific to their group’s information needs by selecting databases and search fields specific for that group. Then that tailored search widget can be placed on any web page. See how one of our customers, University of the Arts, London integrated Search Builder into their Lib Guide Subject pages.
Another part of DWT’s Account Management campaign is to help our customers fine-tune their Explorit
Everywhere! service so that the quality of results meets their users’ needs. As we all know, while there are hundreds, if not, thousands of information databases, they all perform differently and provide a wide range of data. And the advantage of EE! is its ability to bring all those results from multiple databases together. So, given that your databases may be a combination of medical, news, journals, or other informational websites, we can help you tune your application so your searches display the most meaningful data.
In the coming months, here are a couple of different things we will be doing behind-the-scenes to make your EE! searcher work better for your users:
- We recently added the means to fine-tune how the first page of results are generated. As part of our Account Management services, we will be reviewing your EE! service to ensure that those first results are the highest ranking results.
- In addition to our daily monitoring of your connectors to make sure they are working, we will be doing periodical reviews to make sure we are optimizing them for quality results. Sources often add new search fields and new results data. For example, we have been incorporating the ability to search using journal DOI or PubMed ID numbers in the Full Record field in EE!, which includes the Quick Search box. Our goal is to make sure your EE! service makes the most of the source.
Lastly, we will be asking our customers to fill out a short survey to facilitate more discussion so we can better understand how to help make Explorit Everywhere! be the best service possible. We also look forward to having more conference calls with our customers. Ultimately, we want our customers’ users to best use EE! to fulfill their information needs.
I and some of my staff have had the pleasure to work closely with Grace Baysinger, Head Librarian and
Bibliographer of the Swain Chemistry and Chemical Engineering Library at Stanford University, to develop a unique research gateway focused on chemical safety.
My relationship with Grace goes back two decades when I developed SciSearch@LANL, a precursor to Web of Science for Los Alamos National Laboratory and Grace was our customer representative at Stanford.
More recently we have worked closely with Grace on development of xSearch (Stanford’s name for Explorit Everywhere!, our largest federated search implementation.
I have asked Grace to give us an overview of this important chemical safety resource that we have developed together.
While chemists are one of the most intensive users of information, many are unfamiliar with chemical safety resources they should consult before working in the lab. Chemists consulting materials safety data sheets or safety data sheets (MSDS/SDS) discover that they often have “NA” or not available for physical properties that they need for their lab work.
Grace’s goal in working with Deep Web Technologies was to develop a research gateway that provides access to a wide collection of information sources focused on chemical safety. This gateway uses federated search technology, the ability to search multiple sources at one time, which helps users find the information they need more effectively and efficiently. It is possible to view results visually, move to a particular resource in the search results, and to set up an alert to be notified when new information is published on a topic. Common search terms include chemical name, CAS Registry Number, and searching topics using keywords. If a resource contains InChI or SMILES values for a chemical substance, it may be used as a search term too.
Moving soon from prototype to production, the Stanford University version of the chem safety gateway will be a collaborative effort between the Stanford University Libraries and Stanford Environmental Health and Safety. This gateway has 60+ information sources that includes SDS/MSDS, safety data, syntheses and reactions databases, citation databases, full-text eBooks and eJournals, plus a number of Health & Environmental Safety (EH&S) websites. While the SDS/MSDS and safety data resources form the core of this collection, curated databases such as Organic Syntheses, Organic Reactions, Science of Synthesis, Merck Index, Reaxys, and the e-EROS (Encyclopedia of Reagents for Organic Synthesis) include protocols and safety information that is useful to bench chemists. eBooks and eJournals are full-text searchable, allowing researchers to find property and safety information in handbooks and methods and protocols in journal articles. EH&S websites from selected universities plus websites for the ACS Committee on Chemical Safety, ACS Division of Chemical Health and Safety, and the U.S. Chemical Safety Board will help users discover information such as training materials, standard operating procedures, and lessons learned. Search results for a chemical name search also include the “Chemical Box” from Wikipedia in the right column.
At the ACS National Spring 2016 Meeting held in San Diego, Grace and colleagues from Stanford’s EH&S Unit gave a presentation on Using a chemical inventory system to optimize safe laboratory research in a Division of Chemical Health and Safety symposium. The first part of this presentation covers ChemTracker and the latter part (starting on slide 23) shows screen shots of the Stanford Chem Safety Gateway. Slide 24 has a list of the resources being searched in the Stanford gateway. For a current list of resources, please see Grace’s recent blog entry, Chemical safety resource gateway available.
Grace then helped the DWT team develop a public version of the Chem Safety Gateway that is available to test-drive at:
This public site searches a subset of the sources that are searched at the Stanford site as DWT is not able to search subscription sources through their public site.
Please test-drive the public version of the Explorit Everywhere! Chem Safety Gateway and give Abe feedback as to how useful it is to be able to search a broad set of chemical safety resources at the same time. Be sure to register (not required to search) to use the Alerts and MyLibrary features. Did you find that the gateway returned relevant results? What sources (subscription or public) would you add to make this gateway even better? Do you have any other suggestions for improving the gateway?
Please email your feedback to email@example.com
A couple of months ago I came across the claim that we are generating 2.5 billion GB of new data every day and thought that I should write a fun little blog article about this claim. Here it is.
This claim, repeated by many, is attributed to IBM’s 2013 Annual Report. In this report, IBM claims that in 2012 2.5 billion GB of data was generated every day of which 80% of this data is unstructured and includes audio, video, sensor data and social media as some of the newer contributions to this deluge of data being generated. IBM also claims in this report that by 2015 1 trillion connected objects and devices will be generating data across our planet.
So how big is a billion GB? A billion GB is an Exabyte (a 1 followed by 18 zeros), i.e., 1000 petabytes or 1,000,000 terabytes.
My research took me to this article in Scientific American — What is the Memory Capacity of the Human Brain? which I pursued to try to put into context all the huge numbers I’ve been throwing around. Professor of Psychology Paul Reber estimates that:
The human brain consists of about one billion neurons. Each neuron forms about 1,000 connections to other neurons, amounting to more than a trillion connections. If each neuron could only help store a single memory, running out of space would be a problem. You might have only a few gigabytes of storage space, similar to the space in an iPod or a USB flash drive. Yet neurons combine so that each one helps with many memories at a time, exponentially increasing the brain’s memory storage capacity to something closer to around 2.5 petabytes (or a million gigabytes).
So if I’m doing my math right, the 2.5 billion GB of information generated daily could be stored by 1,000 human brains and human memory capacity is still way higher than our electronic storage capacity.
My research then took me to this interesting blog article from 2015 – Surprising Facts and Stats About The Big Data Industry. Some of the facts and stats that I found most interesting in their infographic include:
- Google is the largest ‘big data’ company in the world, processing 3.5 billion requests per day, storing 10 Exabytes of data.
- Amazon hosts the most servers of any company, estimated at 1,400,000 servers with Google and Microsoft close behind.
- Amazon Web Services (AWS) are used by 60,000 companies and field more than 650,000 requests every second. It is estimated that 1/3 of all Internet users use a website hosted on AWS daily, and that 1% of all Internet traffic goes through Amazon.
- Facebook collects 500 terabytes of data daily, including 2.5 billion pieces of content, 2.7 billion likes and 300 million photos.
- 90% of all the data in the world was produced in the last 2 years.
- It is estimated that 40 Zettabytes (40,000 Exabytes) of data will be created by 2020.
Another interesting infographic on how much data was generated every minute in 2014 by some of our favorite web applications is available at – Data Never sleeps 2.0.
I would be remiss if I didn’t tie my blog article to what we do at Deep Web Technologies. So please take a look at our marketing piece – Take on Big Data & Web Sources with Deep Web Technologies. We’d love to hear from you and explore how we can feed content and data from a myriad of disparate sources to your big data analytics engine on the back end as well as explore how we can enhance the insights derived by your big data solutions by providing real-time access to content that complement these insights.