- Product Trial
The Deep Web fascinates most of us and scares some of us, but is used by almost all of us. While over the past couple of years, more and more information has surfaced about the Deep Web, finding reputable information in those depths is still shrouded in mystery. Abe Lederman, CEO of Deep Web Technologies, wrote a guest article for Refer Summer 2016, republished in part below. Refer is an online journal published three times a year for the Information Services Group of the Chartered Institute of Library and Information Professionals (CILIP).
The Web is divided into 3 layers: the Surface Web, the Deep Web and the Dark Web. The Surface Web consists of several billion web sites, different subsets of which are crawled by search engines such as Google, Yahoo and Bing. In the next layer of the Web, the Deep Web consists of millions of databases or information sources – public, subscription or internal to an organization. Deep Web content is usually behind paywalls, often requires a password to access or is dynamically generated when a user enters a query into a search box (e.g. Netflix), and thus is not accessible to the Surface Web search engines. This content of the Deep Web is valuable because, for the most part, it contains higher quality information than the Surface Web. The bottom-most layer, called the Dark Web, gained a lot of notoriety in October 2013 when the FBI shut down the Silk Road website, an eBay-style marketplace for selling illegal drugs, stolen credit cards and other nefarious items. The Dark Web guarantees anonymity and thus is also used to conduct political dissent without fear of repercussion. Accessing the gems that can be found in the Deep Web is the focus of this article.
Michael Bergman, in a seminal white paper published in August 2001 entitled – The Deep Web: Surfacing Hidden Value, coined the term “Deep Web”. The Deep Web is also known as the Hidden Web or the Invisible Web. According to a study conducted in 2000 by Bergman and colleagues, the Deep Web was 400-550 times larger than the Surface Web, consisting of 200,000 websites, 550 billion documents and 7,500 terabytes of information. Every few years while writing an article on the Deep Web, I search for current information on the size of the Deep Web and I’m not able to find anything new and authoritative. Many articles that I come across still, like this article, refer to Bergman’s 2001 white paper.
Many users may not be familiar with the concept of the Deep Web. However If they have searched the U.S. National Library of Medicine PubMed Database, if they have searched subscription databases from EBSCO or Elsevier, if they have gone and searched the website of a newspaper such as the Financial Times or went to purchase a train ticket online, then they have been to the Deep Web.
If you are curious about what’s in the Deep Web and how can to find some good stuff, here are some places you can go to do some deep web diving…
A couple of months ago I came across the claim that we are generating 2.5 billion GB of new data every day and thought that I should write a fun little blog article about this claim. Here it is.
This claim, repeated by many, is attributed to IBM’s 2013 Annual Report. In this report, IBM claims that in 2012 2.5 billion GB of data was generated every day of which 80% of this data is unstructured and includes audio, video, sensor data and social media as some of the newer contributions to this deluge of data being generated. IBM also claims in this report that by 2015 1 trillion connected objects and devices will be generating data across our planet.
So how big is a billion GB? A billion GB is an Exabyte (a 1 followed by 18 zeros), i.e., 1000 petabytes or 1,000,000 terabytes.
My research took me to this article in Scientific American — What is the Memory Capacity of the Human Brain? which I pursued to try to put into context all the huge numbers I’ve been throwing around. Professor of Psychology Paul Reber estimates that:
The human brain consists of about one billion neurons. Each neuron forms about 1,000 connections to other neurons, amounting to more than a trillion connections. If each neuron could only help store a single memory, running out of space would be a problem. You might have only a few gigabytes of storage space, similar to the space in an iPod or a USB flash drive. Yet neurons combine so that each one helps with many memories at a time, exponentially increasing the brain’s memory storage capacity to something closer to around 2.5 petabytes (or a million gigabytes).
So if I’m doing my math right, the 2.5 billion GB of information generated daily could be stored by 1,000 human brains and human memory capacity is still way higher than our electronic storage capacity.
My research then took me to this interesting blog article from 2015 – Surprising Facts and Stats About The Big Data Industry. Some of the facts and stats that I found most interesting in their infographic include:
- Google is the largest ‘big data’ company in the world, processing 3.5 billion requests per day, storing 10 Exabytes of data.
- Amazon hosts the most servers of any company, estimated at 1,400,000 servers with Google and Microsoft close behind.
- Amazon Web Services (AWS) are used by 60,000 companies and field more than 650,000 requests every second. It is estimated that 1/3 of all Internet users use a website hosted on AWS daily, and that 1% of all Internet traffic goes through Amazon.
- Facebook collects 500 terabytes of data daily, including 2.5 billion pieces of content, 2.7 billion likes and 300 million photos.
- 90% of all the data in the world was produced in the last 2 years.
- It is estimated that 40 Zettabytes (40,000 Exabytes) of data will be created by 2020.
Another interesting infographic on how much data was generated every minute in 2014 by some of our favorite web applications is available at – Data Never sleeps 2.0.
I would be remiss if I didn’t tie my blog article to what we do at Deep Web Technologies. So please take a look at our marketing piece – Take on Big Data & Web Sources with Deep Web Technologies. We’d love to hear from you and explore how we can feed content and data from a myriad of disparate sources to your big data analytics engine on the back end as well as explore how we can enhance the insights derived by your big data solutions by providing real-time access to content that complement these insights.
Editor’s Note: This is a guest article by Andreea-roxana Obreja. Andreea graduated from the University of Portsmouth, United Kingdom with a First Class Honours Degree in Business Information Systems. Her personal interest for covert data and online research have inspired her to author a comprehensive review of the potential of the Deep Web as a business tool for her final year project. The project has been awarded the Clever Touch Prize for the most Original Business Systems Project by the University of Portsmouth. The conclusions of her project will be presented at the 12th Conference of the Italian Chapter of AIS at the Sapienza University of Rome.
We might be using the “Deep Web” every day without calling it this way or even being aware of its existence. Simply filling in a web form enables us to access the Deep Web and retrieve data from a variety of databases, some of which are free, subscription-based or have major access costs attached. Any online data used for business purposes (not necessarily the same purposes for which it has been collected) can be risky, but not knowing what data there is out there about you and your company represents a significantly higher threat. On the other hand, a thorough, Deep Web search can greatly benefit companies researching competitors, potential employees, customers and business trends.
There are various types of data that can be accessed using intermediate technical skills and a few Deep Web resources: information customers share about the organisation and its products, information employees share about their jobs, products they are working on and company strategy/policies. More importantly, data aggregated from publicly-available databases can reveal costly, confidential information.
In terms of resources, an initial Deep Web exploration does not imply major investment or require a team of highly skilled IT developers. Freely available tools such as DWT’s Biznar represent an excellent starting point to explore a variety of authoritative business databases for a real-time search. Other subject-specific publicly available search portals include Mednar for medical researchers or WorldWideScience.org for scientific information. This kind of exploration can be learnt and done in-house with minimum resources and can save your company many hours of online searching using traditional search engines. For on-demand searches, constant monitoring of specific databases and alerts, commercial applications such as those powered by Explorit Everywhere! can facilitate the use of a targeted Deep Web search strategy, advise on the content that needs monitoring and provide a unified access point to all the necessary data sources.
Going back to the types of data that might be made visible through Deep Web resources without its owner being aware, currently, intellectual property on the Deep Web is a matter under scrutiny. While traditional search engines might only take into account the big picture, trying to match your search terms in the title, abstract and key words; Deep Web tools can perform fully comprehensive searches. Apart from monitoring your own patents, inventions and discoveries online, this could save your company money by preventing you from becoming a litigation target after mistakenly infringing on other company’s intellectual property rights.
The ubiquitous availability of social media applications and people’s urge to share data have led to extensive concerns in terms of how much data about your company are your employees and customers disclosing. Social media enables the creation of enormous amounts of data which is not easily to search and interpret. Most of this data is stored in dynamical databases which are not indexed by traditional Surface Web search engines. This means that they are part of the Deep Web and sometimes only protected by the individual’s privacy settings. With the right Deep Web tools, anyone can monitor the details that customers share about the products, purchasing experience and the customer’s general attitude towards the organisation. More than monitoring various data sources in isolation, aggregating them can reveal new information or give a renewed meaning to the existing (most of the time, publicly-available) one. Cautiousness is advised when aggregating data that has been collected by another organisation as its processing might breach data protection regulation.
On the negative side of things, sheltered by a fake username and encouraged by a number of followers, anyone can express an opinion about the organisation on social media which is going to demand a sum of resources to trace, challenge or prove wrong. More dangerously, the ease of creating and sharing content challenges the employees’ obligation to comply with the company’s non-disclosure policies, making social media sites an ideal source of data about company difficulties, new products or future strategy. Constant monitoring and awareness of these breaches can help the company reinforce policies and put in place contingency plans in order to contain the damages.
Even if you feel that traditional online research tools provide you with all the data necessary to your business activities, Deep Web datasources can no longer be ignored. The Deep Web, and its renowned subset, the Dark Web, is significantly larger compared to the Surface Web and due to its vastness, its content cannot always be monitored or regulated. Being aware of its existence and acquiring technology to monitor your presence and the data about you on it, or to monitor your competitors, might prove beneficial in a market where competitive intelligence is a critical component to success.
Summary based on ‘The business potential of the Deep Web for SMEs’ published as a final year undergraduate project for the University of Portsmouth, UK
People tend to think of Google as the authority in search. Increasingly, we hear people use “google” as a verb, as in, “I’ll just google that.” General users, students and even professional researchers are using Google more and more for their queries, both mundane and scholarly, perpetuating the Google myth: If you can’t find it on Google, it probably doesn’t exist. Google’s ease of use, fast response time and simple interface gives users exactly what they need…or does it?
Teachers say that 94% of their students equate “Research” with “Google”. (Search Engine Land)
“Another concern is the accuracy and trustworthiness of content that ranks well in Google and other search engines. Only 40 percent of teachers say their students are good at assessing the quality and accuracy of information they find via online research. And as for the teachers themselves, only five percent say ‘all/almost all’ of the information they find via search engines is trustworthy — far less than the 28 percent of all adults who say the same.”
Do teachers have a point here? Is it possible that information found via search engines is less than trustworthy, and if so, where do teachers and other serious researchers need to go to find quality information? Deep Web Technologies did a little research of our own to see just how results on Google vs. popular Explorit Everywhere! search engines differs in quality of science sources.
How Google Works
Google, and other popular search engines such as Bing and Yahoo, search the surface web for information. The surface web, as opposed to the Deep Web, consists of public websites that are open to crawlers to read the website’s information and store it in a giant database called an index. When a user searches for information, they are actually searching the index of information, not the website itself. The results that are returned are the ones that people seemed to like in the past, or most popular results for the query. That’s right…the most popular…not necessarily the most relevant information or quality resources.
We should probably also mention those sneaky ads at the top of the page that look informative, but can be quite deceptive. A JAMA article states this about medical search ads:
“Many of the ads, the researchers noted, are very informational — with ‘graphs, diagrams, statistics and physician testimonials’ — and therefore not identifiable to patients as promotional material.
This kind of ‘incomplete and imbalanced information’ is particularly dangerous, they note, because of its deceptively professional appearance: ‘Although consumers who are bombarded by television commercials may be aware that they are viewing an advertisement, hospital websites often have the appearance of an education portal.'”
Researchers thinking that Google reads their mind and magically returns the right information on the first page of results should think again. The #1 position on a Google results page gets 33% of the traffic, so is a highly sought-after spot on a Google page. Unfortunately, with SEO tricks inflating page-rank on Google and ads vying for top spot, that number one result, or even the top page of results, may not be entirely germane or even contain much scholarly content. But those results rank high because they’ve worked the Google system.
Science.gov, developed and maintained by the DOE Office of Scientific and Technical Information, uses Explorit Everywhere! to search over 60 databases and over 2200 selected websites from 15 federal agencies. The results are from authoritative, government sources, and extraordinarily relevant. When you perform a search on Science.gov, there is no question about the sources you are searching. Explore the difference!
Whether you are a student or scientist, knowing where to start your science search is very important. In most cases, serious research doesn’t start with Google. A 2014 IDC study shows that only 56% of the time do knowledge workers find the information required to do their jobs. Having the right sources available through an efficient Deep Web search like Explorit Everywhere! is critical to finding significant scientific information and staying ahead of the game.
In a highly cited September 2001 article, The Deep Web: Surfacing Hidden Value, Michael Bergman coined the term “Deep Web” and wrote:
Searching on the Internet today can be compared to dragging a net across the surface of the ocean. While a great deal may be caught in the net, there is still a wealth of information that is deep, and therefore, missed. The reason is simple: Most of the Web’s information is buried far down on dynamically generated sites, and standard search engines never find it.
In February, 2002 just a few months after Michael Bergman published this article I saw the huge potential of the “Deep Web” for providing access to a wealth of high-quality content not available via search engines such as Google, so incorporated Deep Web Technologies that year. The “Deep Web” was a more accurate term for what had been referred to for a number of prior years as the “Hidden Web” or the “Invisible Web”. I’m not sure who eventually coined the term “Dark Web” or when. One early reference I found was to a chapter in a book on Intelligence and Security Informatics published in 2005: “The Dark Web Portal Project: Collecting and Analyzing the Presence of Terrorist Groups on the Web.”
Everything was mostly good until October 2013 when the FBI shut down the Silk Road website, a Dark Web eBay-style marketplace for selling illegal drugs, stolen credit cards and other nefarious items. Since the take-down of Silk Road there have been a plethora of articles published which refer to the Dark Web as the Deep Web and lead to a lot of confusion and heartache for the CEO of one company in particular, Deep Web Technologies.
On November 2013, following a cover story in Time Magazine, on the Secret Web, which soon was referenced as the Deep Web, I wrote a letter to the Editor of Time and followed it with the blog article – The Deep Web isn’t all drugs, porn and murder to no avail.
In the past few months following the announcement of DARPA’s Memex project which states as its goal, “Creation of a new domain-specific indexing and search paradigm will provide mechanisms for improved content discovery, information extraction, information retrieval, user collaboration, and extension of current search capabilities to the deep web, the dark web, and nontraditional (e.g. multimedia) content,” there have been many more articles published equating the “deep web” and the “dark web” such as the following article about NASA’s efforts to leverage the memex efforts: NASA has big plans for DARPA’s scary “Deep Web”.
What prompted me to write this blog article is that I learned a few days ago that Epix has produced a documentary, that is going to be released on May 31, 2015, titled Deep Web.
“Extending far beyond the confines of Google and Facebook, there is a vast section of the World Wide Web that is a hidden alternate internet. Appropriately named the Deep Web, this mysterious and complex cyberspace serves as an outlet for anonymous communication and was home to Silk Road, the online black market notorious for drug trafficking. The intricacies of this concealed cyber realm caught the attention of the general public with the October 2013 arrest of Ross William Ulbricht – the convicted 30-year-old entrepreneur accused to be ‘Dread Pirate Roberts,’ the online pseudonym of the Silk Road leader. Making its World Television Premiere this spring, Deep Web – an EPIX Original Documentary written, directed and produced by Alex Winter (Downloaded) – seeks to unravel this tangled web of secrecy, accusations, and criminal activity, and explores how the outcome of Ulbricht’s trial will set a critical precedent for the future of technological freedom around the world.”
Clearly Dark Web would be a more appropriate title for this documentary and might attract a bigger audience than Deep Web, but I’m not so fortunate. What am I to do?
Editor’s Note: This is a guest article by Lisa Brownlee. The 2015 edition of her book, “Intellectual Property Due Diligence in Corporate Transactions: Investment, Risk Assessment and Management”, originally published in 2000, will dive into discussions about using the Deep Web and the Dark Web for Intellectual Property research, emphasizing its importance and usefulness when performing legal due-diligence.
Lisa M. Brownlee is a private consultant and has become an authority on the Deep Web and the Dark Web, particularly as they apply to legal due-diligence. She writes and blogs for Thomson Reuters. Lisa is an internationally-recognized pioneer on the intersection between digital technologies and law.
In this blog post I will delve in some detail into the Deep Web. This expedition will focus exclusively on that part of the Deep Web that excludes the Dark Web. I cover both Deep Web and Dark Web legal due diligence in more detail in my blog and book, Intellectual Property Due Diligence in Corporate Transactions: Investment, Risk Assessment and Management. In particular, in this article I will discuss the Deep Web as a resource of information for legal due diligence.
When Deep Web Technologies invited me to write this post, I initially intended to primarily delve into the ongoing confusion The Deep Web and the Dark Web – Why Lawyers Need to Be Informed.
Deep Web: a treasure trove of and data and other information
The Deep Web is populated with vast amounts of data and other information that are essential to investigate during a legal due diligence in order to find information about a company that is a target for possible licensing, merger or acquisition. A Deep Web (as well as Dark Web) due diligence should be conducted in order to ensure that information relevant to the subject transaction and target company is not missed or misrepresented. Lawyers and financiers conducting the due diligence have essentially two options: conduct the due diligence themselves by visiting each potentially-relevant database and conducting each search individually (potentially ad infinitum), or hire a specialized company such as Deep Web Technologies to design and setup such a search. Hiring an outside firm to conduct such a search saves time and money.
Deep Web data mining is a science that cannot be mastered by lawyers or financiers in a single or a handful of transactions. Using a specialized firm such as DWT has the added benefit of being able to replicate the search on-demand and/or have ongoing updated searches performed. Additionally, DWT can bring multilingual search capacities to investigations—a feature that very few, if any, other data mining companies provide and that would most likely be deficient or entirely missing in a search conducted entirely in-house.
What information is sought in a legal due diligence?
A legal due diligence will investigate a wide and deep variety of topics, from real estate to human resources, to basic corporate finance information, industry and company pricing policies, and environmental compliance. Due diligence nearly always also investigates intellectual property rights of the target company, in a level of detail that is tailored to specific transactions, based on the nature of the company’s goods and/or services. DWT’s Next Generation Federated Search is particularly well-suited for conducting intellectual property investigations.
In sum, the goal of a legal due diligence is to identify and confirm basic information about the target company and determine whether there are any undisclosed infirmities with the target company’s assets and information as presented. In view of these goals, the investing party will require the target company to produce a checklist full of items about the various aspects of the business (and more) discussed above. An abbreviated correlation between the information typically requested in a due diligence and the information that is available in the Deep Web is provided in the chart attached below. In the absence of assistance by Deep Web Technologies with the due diligence, either someone within the investor company or its outside counsel will need to search in each of the databases listed, in addition to others, in order to confirm the information provided by the target company is correct and complete. While representations and warranties are typically given by the target company as to the accuracy and completeness of the information provided, it is also typical for the investing company to confirm all or part of that information, depending on the sensitivities of the transaction and the areas in which the values–and possible risks might be uncovered.
The April/May 2015 issue of Multilingual.com Magazine features a new article, “Advancing science by overcoming language barriers,” co-authored by DWT’s own Abe Lederman, and Darcy Katzman. The article discusses the Deep Web vs. the dark web, and the technology needed to find results in scientific and technical, multilingual Deep Web databases. It also speaks of the efforts of the WorldWideScience Alliance in addressing the global need for a multilingual search through the creation of the WorldWideScience.org federated search application.ogy called:
- Distributed Search
- Broadcast Search
- Unified Search
- Data Fusion
- Parallel Search
- Cross-Database Search
- Single Search
- Integrated Search
- Universal Search
For the most part, all of these mean about the same thing: An application or a service that allows users to submit their query to search multiple, distributed information sources, and retrieve aggregated, ranked and deduplicated results.
But the question remains: Is “federated search” a master index, or a real-time (on-the-fly) search? And this is a very good question, given our familiarity with Google and their enormous public index. Sol Lederman raised this question back in 2007 on the Federated Search Blog, What’s in a Name?
“The distinction, of course, is crucial. If you’re meeting with a potential customer who believes you’re discussing an approach where all content is harvested, indexed and accessed from one source but you think you’re discussing live search of heterogeneous sources then you’re talking apples and oranges.”
Deep Web Technologies’ real-time approach gives us an advantage over building a master index which we’ll discuss in our next blog post. In the meantime, can you think of any other names for what we do? We’d love to hear from you!
Time Magazine’s current issue (November 11, 2013) cover story, “The Secret Web: Where Drugs, Porn and Murder Live Online,” reveals the dark side of the Deep Web, where criminals can hide from surveillance efforts to commit nefarious deeds anonymously. The buzz about the evils of the Dark Web (as Time’s Secret Web is commonly referred to) started early last month when Ross Ulbright was arrested in San Francisco “on charges of alleged murder for hire and narcotics trafficking violation” and identified as the founder and chief operator of Silk Road. Ulbright, known as “Dread Pirate Roberts”, is accused of running what is described in Wikipedia as an underground website sometimes called the “Amazon.com of illegal drugs” or the “eBay for drugs.” And, of course, the government shut down Silk Road.
As founder and president of Deep Web Technologies, I take exception to the article’s referral of the dark regions of the Web broadly as the Deep Web. The term Deep Web, first coined in 2000, refers to huge areas of the Internet that serve legitimate organizations and the public. Not all of the Deep Web is dark. In fact, most of it isn’t. In fairness to the Time Magazine article authors, Grossman and Newton-Small, they do make this point early on:
Technically the Deep Web refers to the collection of all the websites and databases that search engines like Google don’t or can’t index, which in terms of the sheer volume of information is many times larger than the Web as we know it.
Understanding Deep Web Technologies that gives a hint as to what treasures lie in the Deep Web.
The deep web is everywhere, and it has much more content than the surface web. Online TV guides, price comparison web-sites, services to find out of print books, those driving direction sites, services that track the value of your stocks and report news about companies within your holdings – these are just a few examples of valuable services built around searching deep web content.
But, not only is the Deep Web of interest to consumers, it’s of particular value to academicians, scientists, researchers, and a whole slew of business people who rely on timely access to cutting edge Deep Web content to maintain a competitive edge.
Here’s another snippet, this one from a series, “Federated Search Finds Content that Google Can’t Reach,” emphasizing the importance of Deep Web searching to research organizations.
Federated search facilitates research by helping users find high-quality documents in more specialized or remote corners of the Internet. Federated search applications excel at finding scientific, technical, and legal documents whether they live in free public sites or in subscription sites. This makes federated search a vital technology for students and professional researchers. For this reason, many libraries and corporate research departments provide federated search applications to their students and staff.
Hopefully you’re convinced that there’s valuable information in the Deep Web. Now, no one knows exactly how big the Deep Web is compared to the Surface Web that Google, Bing, and the others crawl but it’s likely that the Deep Web is hundreds of times larger. This is great when you have access to tools like Deep Web Technologies’ Explorit search engine but it might also make you nervous wondering how you can find that needle in the haystack in a web that is hundreds of times larger than the one you’re familiar with that is overwhelming you with too much information and too much junk mixed in with the good stuff.
If what is in the Deep Web intrigues you, try a few of our Deep Web applications to see a bit of the richness that lies beneath the surface of the Web.
Update: I have also written an email to Time Magazine which I’ve copied below. I don’t know if they will publish it or not, but I certainly hope that they will recognize that the Deep Web is much more than a haven for criminals.
As someone who makes his living providing access to the legitimate parts of the Deep Web I am very concerned that your article paints a dark picture of the Deep Web as a whole. The company I founded, Deep Web Technologies, Inc., searches Deep Web sources on behalf of scientists, researchers, students and business people. My concern is that the public, and my potential customers, will equate all things related to the Deep Web with dark criminal activity. Please help me to correct this potential misperception to the reality that the Deep Web is about those areas of the Web that contain high quality content and that the Dark Web is just a fringe neighborhood within the Deep Web that most of us will never venture into.
Deep Web Technologies, Founder and President