Building a Multi-Node Hadoop v2 Cluster with Ubuntu on Windows Azure

With HDInsight, the Windows Azure platform provides a powerful Platform-as-a-Service (PaaS) offering for quickly spinning up and managing Hadoop clusters on top of Windows VMs. These clusters are based on the HortonWorks Data Platform (HDP) distribution. Currently, the newest version of HDP in HDInsight is 1.3.0, which is deployed with the HDInsight version 2.1 (go here for the Microsoft versioning story). For sure, HortonWorks will eventually release a 2.x version for HDInsight on Windows, but if you prefer Ubuntu or you need a Hadoop v2 cluster now – what to do …?

Well, the good news is that Windows Azure is a very flexible platform and does not only provide platform services like HDInsight and many others, but also a powerful Infrastructure-as-a-Service (IaaS) model which allows you to deploy virtual machines based on Windows or Linux and manage them in virtual networks. So, Windows Azure IaaS allows you to build your own Apache Hadoop v2 cluster running in Ubuntu. In this post I will walk you through the steps to get there. We will build up a Hadoop 2.2.0 cluster consisting of a single master node and 2 slave nodes, using a custom-built VM image for all Hadoop nodes. Basically, the post will enable you to build arbitrary-sized Hadoop clusters on top of Windows Azure.


I assume that you already have some practical knowledge about working with Windows Azure and know how to create networks, virtual machines, etc. I will reference the online documentation in order to keep the post compact.

Create a Virtual Network

All VMs in our Hadoop cluster will be deployed to a single virtual network in order to achieve network visibility among the nodes. So, let’s create a new network with the following parameters (go here for a general step-by-step walkthrough):

Name hadoopnet
Address Space 10.0.0.0/8
Subnet Name hadoop
Subnet IP Range 10.0.0.0/24

Of course you can pick different values and IP ranges but I recommend you stick with the ones above as I will refer to them throughout the post.

Remark: You don’t need to configure a DNS server as long as you deploy less than a total of 50 VMs in the cluster. We will deploy all Ubuntu machines to the same cloud service which gives us Azure-provided name resolution. If you need more than 50 machines (which is the max. number of VMs per cloud service) you would have to deploy them to multiple cloud services and additionally deploy a DNS server in the virtual network or use the hosts file.

Build an Ubuntu Image

First we are going to build an Ubuntu VM that will serve as a template for all other nodes in the cluster. You can create the VM in the management portal or use PowerShell, whatever your preferred approach is. Go here for general instructions about how to do it in the portal. Use the following parameters when creating the new VM:

Image Ubuntu Server 12.04 LTS
Virtual Machine Name hdtemplate
Size Small
New User Name hduser
Password <your choice>
Cloud Service Create a new cloud service
Cloud Service DNS Name hadoopv2.cloudapp.net
Virtual Network hadoopnet
Subnet hadoop

Leave defaults for all other parameters. You might have to pick a different DNS name if the one above is already taken. In this case you will have to replace all URLs in this post by your own cloud service name.

As soon as the VM is in ‘Running’ state connect via SSH (see here), using the hduser account you have specified above.

(1) Install Java

As pretty much all of Hadoop is written in Java, first thing we need to do is install a JDK. So, in the shell use the commands below:

sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java7-installer
sudo apt-get install oracle-java7-set-default

(2) Install Hadoop

Next, we will download Hadoop 2.2.0 for our environment. Other versions from the v2 branch should also work. You might want to select a different mirror for your download.

wget http://apache.mirrors.tds.net/hadoop/common/hadoop-2.2.0/hadoop-2.2.0.tar.gz
tar -xvzf hadoop-2.2.0.tar.gz
sudo mv hadoop-2.2.0 /usr/local

Now, our Hadoop distribution is located in /usr/local/hadoop-2.2.0.

(3) Set Environment Variables for Java & Hadoop

We’ll have to set a couple of environment variables for Java & Hadoop in order to work properly. To do so go to the hduser home directory (/home/hduser) and edit the .bashrc file, e.g. by typing nano .bashrc (or use another text editor like vi). Add the following statements to the end of that file:

export HADOOP_PREFIX=/usr/local/hadoop-2.2.0
export HADOOP_HOME=/usr/local/hadoop-2.2.0
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop

# Native Path
export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib"

#Java path
export JAVA_HOME=/usr/lib/jvm/java-7-oracle

# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin:$JAVA_PATH/bin:$HADOOP_HOME/sbin

Next, you have to set JAVA_HOME in $HADOOP_HOME/etc/hadoop/hadoop-env.sh as follows:

export JAVA_HOME=/usr/lib/jvm/java-7-oracle

(4) Configure the Hadoop Cluster

Now, we need to set a couple of parameters in the Hadoop configuration files in the $HADOOP_HOME/etc/hadoop folder.

core-site.xml

<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://master:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/home/hduser/tmp</value>
    </property>
</configuration>

Note, that master is the hostname of the master node in the cluster and tmp is a directory for temporary files, that you need to create in the home path of the hduser. Go ahead and create that directory now.

hdfs-site.xml

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/home/hduser/hdfs/namenode</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/home/hduser/hdfs/datanode</value>
    </property>
</configuration>

Note, that the value for dfs.replication is the number of replicas you want to keep in your HDFS file system. As we will start with two slave nodes let’s set it to 2.

The configuration defines the paths for the Hadoop namenode (dfs.namenode.name.dir) and the datanodes (dfs.datanode.name.dir) . Note, that you need to create those directories before starting the cluster. So, go ahead and create a hdfs directory in the hduser’s home dir and two subdirectories namenode and datanode.

mapred-site.xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

There might only be a mapred-site.xml.template file in the directory, so create a copy first and call it mapred-site.xml.

yarn-site.xml

<configuration>
        <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
        </property>
        <property>
                <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
                <value>org.apache.hadoop.mapred.ShuffleHandler</value>
        </property>
        <property>
                <name>yarn.resourcemanager.resource-tracker.address</name>
                <value>master:8031</value>
        </property>
        <property>
                <name>yarn.resourcemanager.address</name>
                <value>master:8032</value>
        </property>
        <property>
                <name>yarn.resourcemanager.scheduler.address</name>
                <value>master:8030</value>
        </property>
</configuration>

Note, that master is the hostname of the master node in the cluster.

(5) Edit the hosts File

One more thing we need to do is add mappings for all master & slave nodes to the hosts file. In order to do that open the /etc/hosts file (using elevated privileges with sudo) and add the following entries:

10.0.0.4    master
10.0.0.5    slave01
10.0.0.6    slave02

Note, that  these entries depend on the order of VM deployment. The master node will be deployed as the first machine in the virtual network, which (at least today) means that it will get the first available IP address in that subnet, which is 10.0.0.4. In any case you will have to check later on if the master and the slaves got these IP addresses. If not you might have to change the hosts file. Also make sure to specify the correct hostname for the master and all slaves.

(6) Create the VM Image

Now, we’re done preparing the content of our Hadoop VM. Let’s create a custom image as a template for all nodes in the cluster. In order to generalize the VM run the following command:

sudo waagent –deprovision

Exit the SSH session and shutdown the VM in the management portal (or using PowerShell). After the VM has shut down, follow these steps to create an image of the VM. Name the image hdimage. After the process has finished the new image will show up in the Images section of your Azure subscription:

hdimage

Build the Master

Now, we are going to create the first VM, serving as the master node in our Hadoop cluster. Again, you can create the VM in the portal or via PowerShell. Use the following parameters when creating the new VM:

Image hdimage
Virtual Machine Name master
Size Medium
New User Name hduser
Password <your choice>
Cloud Service hadoopv2
Cloud Service DNS Name hadoopv2.cloudapp.net
Virtual Network hadoopnet
Subnet hadoop
Endpoints Name – Public Port – Private Port
HDFS – 50070 – 50070
Cluster – 8088 – 8088
JobHistory – 19888 – 19888

Note, that we are creating the master from our custom image. Make sure to use the same account hduser that we used for the template as well. You might want to pick a different VM size, depending on what you are planning to use your cluster for. In HDInsight, the master node is deployed as a XL instance, the slave nodes as L instances.

Create the three endpoints above as TCP endpoints, they will provide insight into the cluster from the browser later on.

Build the Slaves

Now, we are going to create two VMs serving as slaves in the Hadoop cluster. Use the following parameters when creating the first slave VM:

Image hdimage
Virtual Machine Name slave01
Size Medium
New User Name hduser
Password <your choice>
Cloud Service hadoopv2
Cloud Service DNS Name hadoopv2.cloudapp.net
Virtual Network hadoopnet
Subnet hadoop

Do the same for the second slave using the Virtual Machine Name slave02.

Note, that we are also creating the slaves from the image created above! Make sure to use the same account hduser that we used for the template as well. So, now your Hadoop cluster VMs should look like this in the portal:

cluster

Configure the Master

The master has to know about all the slave nodes in the cluster. Open the $HADOOP_HOME/etc/hadoop/slaves file and add the following entries:

slave01
slave02

Make sure to remove the entry for localhost, as we don’t want the master to be a slave in our setup.

In the next step we will enable the master to execute a password-less SSH to both slaves in order to start the Hadoop daemons. To do that we need to first generate a key pair in the hduser’s home directory (/home/hduser):

ssh-keygen -t rsa -P ""

Accept the default file name (.ssh/id_rsa). Next, copy the public key to slave01 by executing the following command (when prompted enter the password for hduser):

ssh-copy-id -i .ssh/id_rsa.pub hduser@slave01

Now, when you SSH to slave01 you shouldn’t have to enter the password. Test it by typing the following command:

ssh hduser@slave01

It will take you to a hduser@slave01:~$ prompt. Exit the session and repeat the last two commands for slave02.

Format the NameNode

Now we need to format the namenode on the master to prepare HDFS for the cluster. Run the following command on master:

hdfs namenode -format

Check the shell output for errors. If the statement was successful the last lines should look something like this:

formatnamenode

Start the Cluster

Now we’re all set to start the cluster! First, start the namenode like this on the master:

hadoop-daemon.sh start namenode

Next, start the datanodes on the slaves by executing the following statement on master:

hadoop-daemons.sh start datanode

Note that there’s an s at the end of hadoop-daemons! This statement should bring both datanodes on slave01 and slave02 up. In order to check if the nodes came up successfully, open a web browser and go to http://hadoopv2.cloudapp.net:50070/ (you might have to change the DNS name to yours). This should bring up a page with some info about the namenode. In the Cluster Summary there should be 2 Live Nodes listed. Clicking on Live Nodes should bring up a page that looks like this:

livenodes

Next, start the resource manager on master:

yarn-daemon.sh start resourcemanager

Start the node managers on the slaves by running this statement on the master:

yarn-daemons.sh start nodemanager

Note that again there’s an s at the end of yarn-daemons! This statement should bring up the NodeManager process on both slaves. In order to check if the nodes came up successfully, open a web browser and go to http://hadoopv2.cloudapp.net:8088/cluster/nodes. This should bring up a page like this:

clusteroverview

Finally, start the job history server on master like this:

mr-jobhistory-daemon.sh start historyserver

In order to check if the nodes came up successfully, open a web browser and go to http://hadoopv2.cloudapp.net:19888/jobhistory. This should bring up a page like this:

jobhistory

Now, let’s check if all required processes are running. Executing the jps statement will list all running Java processes. Doing so on the master node should result in the following list of processes:

masterprocesses

Doing the same on slave01 and slave02 should show the following processes:

slaveprocesses

Test the Cluster

Finally we will run a MapReduce job and find out it our cluster is working properly. The simplest test is the Pi calculation example that is contained in the standard Hadoop distribution. We will run it with 8 maps and 1000 samples each:

hadoop jar /usr/local/hadoop-2.2.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 8 1000

The final output on the console should look like this:

mapreduce

Navigate to http://hadoopv2.cloudapp.net:19888/jobhistory. The job history should look something like this:

jobhistory2

Final Remarks

Now you’re good to go and build up much larger clusters based on a custom Hadoop image. Doing so in PowerShell will give you speed, agility and save you lots of money compared to the ‘classical’ approach, which is doing that in your own environment. For details about using PowerShell in Windows Azure go here.

One more thing to note: the proper way to use HDFS with Ubuntu in Windows Azure would be to mount one or more data disks on each slave node and locate all HDFS data on those. I didn’t do it in this post in order to keep things simple. This article describes how to attach a data disk to a Linux VM.

Have fun & let me know if the setup described above works for you.

10 comments on “Building a Multi-Node Hadoop v2 Cluster with Ubuntu on Windows Azure
    • You’re perfectly right, that’s even simpler. I just wanted to make clear which single components will be started up. Thanks for pointing that out!

  1. One more thing that I found, It keeping looping the map reduce for like 1 hour.
    13/12/16 17:29:36 INFO mapreduce.Job: map 0% reduce 0%
    13/12/16 17:42:14 INFO mapreduce.Job: map 63% reduce 0%
    13/12/16 17:42:15 INFO mapreduce.Job: map 100% reduce 0%
    13/12/16 17:44:59 INFO mapreduce.Job: map 88% reduce 0%
    13/12/16 17:55:34 INFO mapreduce.Job: map 100% reduce 0%
    13/12/16 18:00:19 INFO mapreduce.Job: map 88% reduce 0%
    13/12/16 18:12:14 INFO mapreduce.Job: map 100% reduce 0%

    • You might want to check the logs of the resource manager and node managers. Does the issue persist if you deploy the cluster to medium-sized VMs?

  2. superb doc.but my datanodes are not starting on slaves.this is the error in the log

    Incorrect configuration: namenode address dfs.namenode.servicerpc-address or dfs.namenode.rpc-address is not configured

  3. Many thanks for that post. It was really helpful and works fine in my case. The only problem that I have is to communicate between the VM’s to run e.g. MPI processes on two or more VM’s. I have created a datadrive on the master and tried to mount it from slaves, but did’t work. Does anybody have a clue on how to do that. Many thanks in advanced.

Leave a Reply

Your email address will not be published. Required fields are marked *