10 Steps to Setup and Manage a Hadoop Cluster Using Ironfan

MSys Editorial Dec 14 - 7 min read

Audio : Listen to This Blog.

Recently, we faced a unique challenge – setup DevOps and management for a relatively complex Hadoop cluster on the Amazon EC2 Cloud. The obvious choice was to use a configuration management tool. Having extensively used Opscode’s Chef and given the flexibility and extensibility Chef provides; it was an obvious choice.

While looking around for the best practices to manage a hadoop cluster using Chef, we stumbled upon: Ironfan

What is Ironfan? In short Ironfan, open-souced by InfoChimps provides an abstraction on top of Chef, allowing users to easily provision, deploy and manage a cluster of servers – be it a simple web application or a complex Hadoop cluster. After a few experiments, we were convinced that Ironfan was the right thing to use as it simplifies a lot of configuration avoiding repetition while retaining the goodness of Chef.
This blog shows how easy it is to setup and manage a Hadoop cluster using Ironfan.


  • Chef Account (Hosted or Private) with knife.rb setup correctly on your client machine.
  • Ruby setup (using RVM or otherwise)


Now you can install IronFan on your machine using the steps mentioned here. Once you have all the packages setup correctly, perform these sanity checks:

  • Ensure that the environment variable CHEF_USERNAME is your Chef Server username (unless your USER environment variable is the same as your Chef username)
  • Ensure the the environment variable CHEF_HOMEBASE points to the location which contains the expanded out knife.rb
  • ~/.chef should be a symbolic link to your knife directory in the CHEF_HOMEBASE
  • Your knife/knife.rb file is not modified.
  • Your Chef user PEM file should be in knife/credentials/{username}.pem
  • Your organization’s Chef validator PEM file should be in knife/credentials/{organization}-validator.pem
  • Your knife/credentials/knife-{organization}.rb file
      • Should contain your Chef organization
      • Should contain the chef_server_url
      • Should contain the validation_client_name
      • Should contain path to validation_key
      • Should contain the aws_access_key_id/ aws_secret_access_key
      • Should contain an AMI ID of an AMI you’d like to be able to boot in ec2_image_info

Finally in the homebase rename the example_clusters directory to clusters. These are sample clusters than comes with Ironfan. Perform a knife cluster list command :

$ knife cluster list
Cluster Path: /.../homebase/clusters
| cluster | path |
| big_hadoop | /.../homebase/clusters/big_hadoop.rb |
| burninator | /.../homebase/clusters/burninator.rb |

Defining Cluster:

Now lets define a cluster. A Cluster in IronFan is defined by a single file which describes all the configurations essential for a cluster. You can customize your cluster spec as follows:

  • Define cloud provider settings
  • Define base roles
  • Define various facets
  • Defining facet specific roles and recipes.
  • Override properties of a particular facet server instance.

Defining cloud provider settings:

IronFan currently supports AWS and Rackspace Cloud providers. We will take an example of AWS cloud provider. For AWS you can provide config information like:

  • Region, in which the servers will be deployed.
  • Availibility zone to be used.
  • EBS backed or Instance-Store backed servers
  • Base Image(AMIs) to be used to spawn servers
  • Security zone with the allowed port range.

Defining Base Roles:

You can define the global roles for a cluster. These roles will be applied to all servers unless explicitly overridden for any particular facet or server. All the available roles are defined in $CHEF_HOMEBASE/roles directory. You can create a custom role and use it in your cluster config.

Defining Environment:

Environments in Chef provide a mechanism for managing different environments such as production, staging, development, and testing, etc with one Chef setup (or one organization on Hosted Chef). With environments, you can specify per environment run lists in roles, per environment cookbook versions, and environment attributes.
The available environments can be found in $CHEF_HOMEBASE/environments directory. Custom environments can be created and used.

Ironfan.cluster 'my_first_cluster' do
# Enviornment under which chef nodes will be placed
environment :dev
# Global roles for all servers
role :systemwide
role :ssh
# Global ec2 cloud settings
cloud(:ec2) do
permanent true
region 'us-east-1'
availability_zones ['us-east-1c', 'us-east-1d']
flavor 't1.micro'
backing 'ebs'
image_name 'ironfan-natty'
chef_client_script 'client.rb'

Defining Facets:

Facets are group of servers within a cluster. Facets share common attributes and roles. For example, in your cluster you have 2 app servers and 2 database servers then you can group the app servers under the app_server facet and the database servers under the database facet.

Defining Facet specific roles and recipes:

You can define roles and recipes particular to a facet. Even the global cloud settings can be overridden for a particular facet.

facet :master do
instances 1
recipe ‘nginx’
cloud(:ec2) do
flavor ‘m1.small’
security_group(:web) do
role :hadoop_namenode
role :hadoop_secondarynn
role :hadoop_jobtracker
role :hadoop_datanode
role :hadoop_tasktracker
facet :worker do
  instances 2
  role :hadoop_datanode
  role :hadoop_tasktracker

In the above example we have defined a facet for Hadoop master node and a facet for worker node. The number of instances of master is set to 1 and that of worker is set to 2. Each master and worker facets have been assigned a set of roles. For master facet we have overridden the ec2 flavor settings as m1.medium. Also the security group for the master node is set to accept incoming traffic on port 80 and 443.

Cluster Management:

Now that we are ready with the cluster configuration lets get a hands on cluster management. All the cluster configuration files are placed under the $CHEF_HOMEBASE/clusters directory. We will place our new config file as hadoop_job001_cluster.rb. Now our new cluster should be listed in the cluster list.

List Clusters:

$ knife cluster list
Cluster Path: /.../homebase/clusters
| cluster     | path                    |
hadoop_job001  HOMEBASE/clusters/hadoop_job001_cluster.rb

Show Cluster Configuration:

$ knife cluster show hadoop_job001
Inventorying servers in hadoop_job001 cluster, all facets, all servers
my_first_cluster:             Loading chef
my_first_cluster:             Loading ec2
my_first_cluster:             Reconciling DSL and provider information
| Name                        | Chef? | State       | Flavor   | AZ         | Env |
| hadoop_job001-master-0   | no    | not running | m1.small | us-east-1c | dev |
| hadoop_job001-client-0   | no    | not running | t1.micro | us-east-1c | dev |
| hadoop_job001-client-1   | no    | not running | t1.micro | us-east-1c | dev |

Launch Cluster:

Launch Whole Cluster:

$ knife cluster launch hadoop_job001
Loaded information for 3 computer(s) in cluster my_first_cluster
| Name                        | Chef? | State   | Flavor   | AZ         | Env | MachineID  | Public IP      | Private IP     | Created On |
| hadoop_job001-master-0   | yes   | running | m1.small | us-east-1c | dev | i-c9e117b5 |  |   | 2012-12-10 |
| hadoop_job001-client-0   | yes   | running | t1.micro | us-east-1c | dev | i-cfe117b3 |  |   | 2012-12-10 |
| hadoop_job001-client-1   | yes   | running | t1.micro | us-east-1c | dev | i-cbe117b7 |  |   | 2012-12-10 |

Launch a single instance of a facet:

$ knife cluster launch hadoop_job001 master 0

Launch all instances of a facet:

$ knife cluster launch hadoop_job001 worker

Stop Whole Cluster:

$ knife cluster stop hadoop_job001

Stop a single instance of a facet:

$ knife cluster stop hadoop_job001 master 0

Stop all instances of a facet:

$ knife cluster stop hadoop_job001

Setting up a Hadoop cluster and managing it cannot get easier than this!
Just to re-cap, Ironfan, open-souced by InfoChimps, is a systems provisioning and deployment tool which automates entire systems configuration to enable the entire Big Data stack, including tools for data ingestion, scraping, storage, computation, and monitoring.
There is another tool that we are exploring for Hadoop cluster management – Apache Ambari. We will post our findings and comparisons soon, stay tuned!

Leave a Reply

MSys developed Big Data ETL Workflow design for one of the world’s largest software companies. This group of experts was focused on building new products for their Hadoop Cloud offering. To know more about this customer story download our case study on Big Data ETL Workflow Designer