1. About Hadoop and Big Data
In 2013 I was so lucky go be able to go to the Hadoop Summit in San Jose as well as attending a course in Hadoop while I was there. Here are some inputs and explanations, should you not know it already.
1.1 What is Big Data
First it might be appropriate to try to define what Big Data is. On the Internet there are many definitions, but there are 4 main characteristics, which is defined as the 4 V’s:
- Volume: terabytes, petabytes and even exabytes of data.
- Variety: any type of structured and unstructured data
- Velocity: data flows into our organisation at an ever-increasing rate.
- Variability: referring to the inherent “fuzziness” of the data, in terms of it’s meaning and content.
What is data?. One guy, Jer Thorp, has come up with a very good definition of this: Data is the measurement of something.
Big Data includes all kind of data, structured (has a schema), semi-structured and unstructured (JPGs, PDF files, audio and video)
Big Data also has two inherent important characteristics:
Time-based: a piece of data is something known at a certain moment in time.
Immutable: as it is connected to a point in time, it will not be updated. Changes are new entries.
1.2 What is Hadoop
Hadoop is many things. It consists of the HDFS (Hadoop File System) and Map Reduce and the Hadoop echo-system, which are other projects, build on top of the two first components. Hadoop is an open source project released under the license of Apache. However, some vendors do make closed source solutions on top of Hadoop, such as RAnalytics and others. Some of the biggest contributors to the project are HortonWorks, Yahoo, and Microsoft.
Also a little reminder: Hadoop only makes sense In systems which enables distributed computing over a large number of different servers. That was the whole idea: To be able to compute large amount of data on relatively cheap commodity hardware.
1.3 Components of Hadoop
The core of Hadoop is a framework for storing data and a distributed programming framework.
Hadoop has two key features:
HDFS: The Hadoop Distributed File System, is a distributed scalable and portable file system written I Java for the Hadoop framework. The file system lives on top of a normal “file system”.
MapReduce: Hadoop provides an implementation of MapReduce, a programming model for processing distributed data on clusters of machines.
1.3.1 The Hadoop Ecosystem.
There is a large group of technologies and frameworks associated with Hadoop:
Pig: a scripting language that simplifies the creation of Mapreduce jobs and excels at exploring and transforming data. Ideal for ETL operations.
Hive: provides and SQL-like access to your Big Data.
HBase: a Hadoop database.
HCatalog: for defining and sharing schemas
Ambari: for provisioning, managing and monitoring Apache Hadoop clusters
ZoeKeeper: an open-source server which enables highly reliable distributed coordination.
Sqoop: for efficiently transferring bulk data between Hadoop and relational databases.
Mahout: an Apache project whose goal is to build scalable machine learning libraries
Flume: for efficiently collecting, aggregating and moving large amount of log data.
There are many other products and tools in the ecosystem, including:
Hadoop as a service: includes Microsoft HDInsight and Rackspace Private Cloud
Programming framework: includes Cascading, Hama and Tez
Data integration Tools: includes Talend Open Studio.
HDFS is a distributed file system living on top of normal file system. When the Files are loaded into HDFS, they are spilt in to small blocks and distributed on the Hadoop nodes. That means a big file of TB’s in size are spread out in many small blocks on different nodes (servers). Further the file is replicated from 1 to n times (3 is the default). So a 1 TB file in default mode takes up 3 TB of space, namely the replication factor multiplied with the file size (default 3 x 1 = 3 TB). This is completely transparent to the user, who only sees files as single file instances. Hadoop makes big data to small data, which can be handled, and it makes it robust with the replication. Should something go wrong with one file block, Hadoop will learn that, and just use another block on another node.
This concept is somewhat hard to understand. It is a software framework for developing applications that process large amount of data in parallel across a distributed environment. It, consists of two phases, a map phase and a reduce phase.
1. Map Phase: data is input to the mapper, where it is transformed and prepared for the reducer
2. Reduce Phase: retrieves the data from the mapper and performs the desired computations or analysis.
Some important concepts to understand about MapReduce:
ï The Map and Reduce task run in their own JVM on the DataNodes (which becomes clear to you, when you run your first analysis). This is a major reason, why it is not a good idea to use Hadoop on small data, as it takes time to spin up the different JVM’s
ï The mapper inputs key/value pairs from HDFS and outputs intermediate key/value pairs. The data types of the input and output can be different.
ï After all the mappers finish executing, the intermediate key/value pairs go through a shuffle and sort phase where all the values that share the same key are combined and sent to the same reducer.
ï The reducer inputs the intermediate <key,value> pairs and outputs it’s own <key,value> pairs, which are typically written to HDFS.
ï The number of mappers are is determined by the input format,
ï The number of reducers is determined by the MapReduce job configuration.
ï A Partitioner is used to determine which <key,value> pairs are sent to which reducer.
ï A Combiner can be optionally configured to combine the output of the mapper, which can increase performance by decreasing the network traffic and the shuffle and sort phase.
Or in more plain words: A list of <key,value> pairs mapped into another list of <key,value> pairs which gets grouped by the key and reduced into a list of values.
More information: Google is you friend.
2. Approach data differently.
First of all, Hadoop is made for BigData crunching, and if you run it against other technologies on small files (probably less than 1 TB), it will probably not be a lot faster, although it might be. However, it still gives you some excellent tools to provide structure (Schema) to otherwise unstructured data (data with no Schema)
2.2 Structure versus no structure.
When we start to analyse new projects involving data today, most of what we do is to decide what to keep and what to ignore. All of our relational systems work this way, and also vendors like OSI-soft and WW has an approach, where you’ll have to decide up front what to use and what to ignore. Or put it another way, you have to have thought up the use-cases before getting the data.
This has several advantages but also some very severe disadvantages. It is impossible to know upfront, what will be important to you in the future. And once you realise it, it is too late. If you did not decide to collect the data in the past, it will not be there for analyses in the present (if you haven’t invented the time machine).
If we are going the Big Data way, the approach must be very different. You just save EVERYTHING. Many use the term “lake of data”, i.e. a lake you just pour all your data into. Data should be kept “raw” for as long as possible, to avoid losing data you might need. Later when you need to analyse the data, you provide structure and order.
3. The easiest way to learn PIG and HIVE with real (not so big) data
HortonWorks, which is a major contributer to the Hadoop project and also helping Microsoft to implement HDInsight, has made a really cool Hadoop implementation on a virtual machine that you can download for free. They call it the HortonWorks Sandbox. All you need apart from this is a VM-Ware version 9 or later (unfortunately the old one from Dong does not do the job). You can also choose virtual box, which is free, but I have no experience with this. I have used the VM-Ware on both Linux, Mac and Windows and it works flawlessly everywhere.
3.1 The sandbox.
Download the Sandbox VM-Ware file from HortonWorks (looks something like this ” Hortonworks+Sandbox+1.3+VMware+RC6.ova”) and open it with your VM-Ware program. Before you start it, it is worth to consder if you want to change the default settings for memory and processor cores. I would suggest to have at least 8 GB installed on the host machine and give the virtual machine between 2 and 4 GB, but you can experiment with that as you go along. Once it is up and running, you’ll encounter a screen like this:
Open a browser and type in the IP address in the address field and everything’s then runs in the browser. BEWARE: You might run into problems if you use MSIE9, which is not fully (far from it) HTML5 compatible. Best option is to use Chrome.
Run the tutorials and try out some of you own data, just remember you’ll have no performance gain here, as Hadoop runs in single mode, alas no distributed computing takes place (logically). But it is a great learning tool and by using small subsets of you data, you’ll be able to check out different analysis and cross check their results.
4. HDInsight (Hadoop in the cloud).
Microsoft has hade Hadoop available on Azure cloud platform. So if you want crunch some really big data, you have the possibility to upload as much as 3TB of data with our present Azure company account.
4.1 Getting an Azure enabled account.
First of all you have to make a “Live Account” with a “Live ID” at Microsoft. You should make this account using your company email.
Azure has a completely incomprehensible pricing structure and some of it get very expensive very fast.
4.2 Making your own HDInsight cluster
BEWARE: It seems that Microsoft gives your liveID two different accounts, one as an organisational account and the other as your LiveID account. Once all the account details are taken care of, you must apply to be part of the HDInsight preview scheme.
4.2.1 Simple Cookbook
1. Sign into Azure and go to the main portal.
2. Select HDInsight and get enrolled in the HDInsight preview scheme. Usually this takes only a few minutes, but I might take several days.
3. Make a Storage cluster. HDInsight needs a storage cluster be created
4. When you are enrolled in the preview scheme, you can create a new HDIsight cluster.
5. You should probably delete the cluster and make a new one, if you are not going to use it for a while. A cluster without load costs app. 5500 dkr/month. And the good thing, all of your data (not your custom Scripts though) stays on the storage. So the next time you make a new HDIsight cluster you should just use the “Custom Create” option and choose you old storage/container for the new cluster. HOWEVER, beware, if you have made PIG-Scripts or HIVE-Scripts on the virtual machine, you must copy them and keep the safe elsewhere.
6. When the cluster is up and running, you’ll be able to access it from a remote desktop:
Log into the “Manage you Cluster” and select the “Remote desktop” option. It will download a connection profile and open the remote desktop for you.
7. On the desktop open shortcut for “Hadoop Command Line”. If you want a pig terminal enter the following:
c:\apps\dist\hadoop-1.1.0-SNAPSHOT>..\pig-0.9.3-SNAPSHOT\bin\pig and you will have the “grunt” promt, from which you can execute PIG commands.
8. If you want to run pig scripts you should enter c:\apps\dist\hadoop-1.1.0-SNAPSHOT>..\pig-0.9.3-SNAPSHOT\bin\pig myPigScript.pig
9. For hive it is same procedure. To get a hive prompt enter c:\apps\dist\hadoop-1.1.0-SNAPSHOT>..\hive-0.9.0\bin\hive and you can now execute HQL commands.
10. Hive Scripts are run by c:\apps\dist\hadoop-1.1.0-SNAPSHOT>..\hive-0.9.0\bin\hive myHiveScript.hive
11. To access the HDFS file system, enter HDFS FS -lsr to recursively list all the content in the file system. Of course there are a lot more commands for the file system for you disposal. Hadoop.apache.org -> documentation is the way to find them.
Pig and Hive are “High Level” entries into the Hadoop system. You can of course write your own MAP-Reduce jobs to make a very special analysis or to make new functions for PIG and HIVE in case of any special needs. Also beware of the format of you data (decimal separator, etc)
4.2.2 How do I get data on to my storage.
By making the analysis in the Cloud, Microsoft pretty much assumes that the data is also created in the cloud. To get smaller amount of data, you can use the storage API and some C#, or use one of the free readymade tools out there, such as the azure storage explorer. It will take you a while to upload 1 TB this way though.
5. Useful links: