Let’s configure Hadoop from Ansible

Arjun Chauhan
5 min readDec 17, 2020

--

Automating configuration management using ansible is very convenient when the team size increases and manual configuration becomes difficult. Recently I have started learning about Ansible and I found it very fascinating that for almost any kind of problem that might exist there is a technological solution availiable.

Today I would be creating a playbook to configure hadoop 1 on a freshly booted system.

So before we begin the configuration let’s lay down the steps on how we going to achieve it. This is very crucial because writing things down help us manage our playbook more effectively.

  1. Copy the hadoop and jdk softwares on the managed node
  2. Install these softwares
  3. Create the namenode directory
  4. Configure the hdfs and core-site.xml files
  5. Format the namenode directory
  6. Start the hadoop services

So the only prerequisite today is knowing what hadoop is.

A brief about hadoop

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.

A structure of a typical hadoop hdfs cluster is somewhat like the figure given below

The namenode is responsible for collecting and aggregating the storage from the datanode so that the client can connect to the namenode and take the storage directly from there. The namenode hence is responsible for managing this storage.

So there are two types of nodes in an HDFS or hadoop storage cluster:

  • DataNode
  • NameNode

So the setup between them is not much different. Both need the hadoop and jdk software. jdk software is needed since hadoop is written in java.

I would be configuring the namenode today on the RHEL8 operating system

Let’s begin

Before beginning the tasks let’s see if we have proper connectivity with the managed nodes.

1. Copy the hadoop and jdk softwares on the managed nodes

The copy module allows us to copy the files specified in the src of the local host to the dest folder of the managed nodes. I have copied the files to the root directory of the managed node.

2. Install these softwares

We can install the software using the yum module. We give the location of the rpm file in the name. The state tells that we want the software to be installed. However you can notice I used the command module to install the hadoop software. This is because to install the software I had to use the -force option. This functionality is not supported by the yum module so command module helps in achieving this. Command module allows us to run OS specific command .

3. Create the namenode directory

This is a fairly easy task. We can create a directory using the file module. The directory is created in the path specified. The state tells that we want the directory to be created.

4. Configure the hdfs and core-site.xml files

So the main configuration files to setup the hadoop cluster comes from the hdfs-site.xml file and core-site.xml. These files are already created and setup correctly in my local filesystem. So the task is to correctly place them in the target node hadoop folder which we did here.

Just to give you an idea of what these file contain you can see the content

core-site.xml
hdfs-site.xml

I would not be going into the details of these files . I showed these just so you know what these files contain.

In layman terms, the hdfs-site.xml tells about the namenode directory and also the fact that this current system is configured as the namenode. On the other hand the core-site.xml specifies the network. Since we want that anyone can come to the system, we use the 0.0.0.0 . The hadoop by default runs on the 9001 port so we also specified that. In case of datanode we give the IP of the namenode.

So far so good.

5. Formatting the namenode

This is similar to installing the hadoop software where we used the OS specific command. Formatting the namenode is necessary to update the filesystem about our new configuration.

So now our work is done. What remains to be done is to start the services.

6. Start the hadoop services

The given command starts the hadoop services.

So now we need to run this entire playbook and see what is the output

ansible-playbook -v setup.yml

So the playbook ran without any errors. That’s sweet. Now let me check if the changes actually occur in the target node or not.

That’s so nice. The namenode is configured and working. The task for configuring the datanode is similar. The major difference is in hdfs and the core-site.xml files. Also in place of creating the namenode directory we create the datanode directory. Rest everything remains the same.

Conclusion

So today we created an automated hdfs cluster. This is a very important and crucial task in the industry since there might be a condition where we want to configure 100’s of nodes urgently. Doing this task manually makes little sense since it would be very slow and prone to errors too. Ansible provides an easier and faster way of achieving this in a faster manner.

Thanks.

--

--