Hadoop is a Big Data technology that is currently front and center in the world of Business Intelligence. It is helping to unlock new data driven solutions and insight by efficiently harnessing the power of copious amounts of information.
From the day we started gathering information the ability to store and quickly access data has been important. The card catalog system invented by Melvil Dewey in 1876 was a very effective business intelligence system for the storage, classification and access to books. In the 1990’s new data warehousing and data mining technologies provided insight into many end user focused behaviors from where in a supermarket diapers should be placed to influence additional purchases of beer to allowing companies like Google to store and deliver accurate and personalized search results. Fast forward to today where we have so much data from so many different sources, including traditional business systems and a myriad of devices and digital interactions that new technologies need to be developed to store, process and deliver information. That is where Hadoop fits in.
What is Hadoop?
Directly from the Apache Hadoop website (1),
“The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
The project includes these modules:
Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets .”
In plain business language, Hadoop allows users the ability to efficiently and economically store and process very large data sets using commoditized hardware.
What can Hadoop Do?
Hadoop can store and process huge amounts of data
Hadoop likes and can work with almost all data types (structured, unstructured, machine language etc.)
Hadoop increases fault tolerances or the ability for a system to continue operating properly in the event of a failure of some of its components.
Hadoop is very fast
Hadoop can scale
Hadoop is economical and cost effective
Data Lakes
A relatively new term to describe why we need new solutions like Hadoop is a “Data Lake”. In simple terms a “Data Lake” is a place to store all the data from an organization, both structured and unstructured.
Structured data is traditional data stored in neat and organized rows and columns. Good examples include information from enterprise level systems like accounting, human resources or transaction records.
Unstructured data is often raw information. Good examples are social streams, texts, log files, images, video, emails, and documents.
IBM estimates that, “Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone.” (2). That's a lot of data.
Real World – Real Results
Many technologies have promised to revolutionize how we look at and interact with the world. Those that deliver and disrupt the status quo become the drivers of innovation. For example, networking advancements over the last few decades have helped drive the growth of the Internet. To most users these advancements are transparent as they only interact with the last mile of technology, the segment that they can see touch and feel. Hadoop and related technologies are now starting to make that behind the scenes impact.
Zions Bank leveraged MapR, a commercial distribution of Apache Hadoop to improve security for their customers. A case study on the MapR website summarized as follows:
“Utah-based Zions Bank, a subsidiary of Zions Bancorporation that operates more than 500 offices and 600 ATMs in 10 western U.S. states, relies on MapR for a critical part of their security architecture. MapR helps Zions identify phishing activity in real time and minimize the impact. With MapR, Zions can store larger volumes of data for longer periods and can run more detailed analytics and forensics. MapR provides Zions with unsurpassed security features, ease of management and superior performance capabilities, which allow for a more efficient use of hardware and a better ROI.“ (3)
British Airways leveraged Hortonworks 2.2 HDP, a commercial distribution of Apache Hadoop to migrate a data archive that was stored on its enterprise data warehouse platform to decrease storage costs and create additional storage space for other initiatives.
Per an article in Computerworld UK, at a Hadoop conference in Brussels in 2015, Alan Spanos, Data Exploitation Manager at British Airways stated,
“Since deploying Hortonworks 2.2 HDP…his department has returned on its investment within a year, and is able to deliver 75 percent more free space for new projects, which translates to cost reductions to the airline’s finance team.“ (4)
Look for Hadoop and related technologies to continue to evolve and change the world we live in. Who knows where the next “Spark" (5) may come from.
References and Citations:
Apache Hadoop Website, http://hadoop.apache.org/, 4/8/2016
IBM Website, “What is Big Data?”, http://www-01.ibm.com/software/data/bigdata/what-is-big-data.html, 4/9/2016
MapR Website, “Big Data and Apache Hadoop for Financial Services”, https://www.mapr.com/solutions/industry/big-data-and-apache-hadoop-financial-services, 4/8/2016
Computer World UK, “British Airways cuts memory costs and sees ROI within one year after deploying Hadoop”, http://www.computerworlduk.com/news/infrastructure/british-airways-cuts-memory-costs-sees-roi-within-one-year-after-deploying-hadoop-3607982/, Margi Murphy, 4/15/2015
Spark Website, http://spark.apache.org/, 4/8/2016
Disclaimer, Copyright and Trademark Statement
This article is provided for informational and educational purposes. It makes no warranties as to the claims, accuracy or fitness of information provided, referenced or cited. Use of the information, instructions and any examples contained in this work is at your own risk. There should be no implied endorsement of this article by any person or organization referenced.
All trademarks, company, product and services names, images, descriptions, or public website content are property of their respective owner as source referenced. It is your responsibility to ensure that your use thereof complies with such licenses and/or rights.