Hadoop: A guide for the average business professional – Part One

If you've ever entered 'big data' or 'business intelligence' (BI) into Google, there's a good chance you've come across websites that mention Hadoop.

How has this framework affected data analysis solution consulting? Better yet, what even is Hadoop? Sounds like the name of a stuffed animal. 

The big yellow elephant

Hadoop is an open source software library that enables users to process large information sets across machine clusters (i.e. groups of computers that act as one) by utilising basic programming models.

Ironically enough, Hadoop was named after co-founder Doug Cutting son's yellow stuffed elephant, according to Kevin Sitto and Marshall Presser's "Field Guide to Hadoop". Mr Cutting and co-founder Mike Cafarella started the project to crawl the web and index content, making the basis of a search engine.

The video from Intricity101 provides a rundown of Hadoop's history:

The framework is distributed under the Apache License 2.0, which allows organisations and professionals to: 

  • Download Hadoop or any of its associated projects for personal and commercial uses, as well as for any purposes related to a company's internal operations.
  • Use any versions that an individual or company may create.

As per the Apache License, Hadoop is categorised as an open-source solution, meaning you don't have to pay to use and your developers can apply adjustments to the code based on your BI needs. 

How does it impact my operations?

OK, you can process data across multiple computers, so what? MapR, which develops, distributes and supports Hadoop, described two key technologies: the Hadoop Distributed File System (HDFS) and Hadoop YARN.

HDFS is lauded for delivering scalable data storage. According to MapR, the solution stores data across multiple computers connected to create a single cluster. (A cluster is exactly what it sounds like: A group of computers configured to act as one piece of hardware.) This makes it easier to process information in parallel across all machines. The result is cost-effectiveness: Instead of spending thousands or tens of thousands of dollars maintaining a terabyte of information, HDFS and its associated technologies keep expenses to three figures. 

Alright, so HDFS can store a lot of data. How are you going to process it? That's where YARN comes into play. This program delivers the operational oversight Hadoop needs to operate securely and optimally. In addition, it allows enterprise solutions to access HDFS to process different data simultaneously. YARN essentially gives HDFS the mobility enterprises need. 

If you've made it to the end, you have a pretty good idea of what Hadoop does. In Part Two, we'll discuss how Hadoop impacts big data and BI projects.