Blog

Automated install of CDH5 Hadoop on your laptop with Ansible

Updated April 30, 2025
6 minutes

Installing CDH5 from the tarball distribution is not a really difficult, but getting the pseudo-distributed configuration right is all but straightforward. And since there are a few bugs that need fixing and configuring that needs to be done I automated it.

Automating the steps

All steps that need to be automated are described in my previous blog: Local and Pseudo-distributed CDH5 Hadoop on your laptop

All I needed to do was write some Ansible configuration scripts to perform these steps. For now I automated the steps to download and install CDH5, Spark, Hive, Pig and Mahout. Any extra packages are left as an exercise to the reader. I welcome your pull requests.

Configuration

Ansible needs some information from the user about the directory to install the software into. I first tried to use ansible's vars_prompt module. this kind of works, but the scope of the variable is within the same yml file only. And I need it to be a global variable. After testing several of ansibles ways to provide variables I decided upon using a bash script to get the user's input and provide ansible with that information through the --extra-vars command line option.

Next to that we want to use ansible to run a playbook. This means that we need to have the ansible-playbook command available. We assume ansible-playbook is on the PATH and will work.

Getting the install scripts

Getting the install scripts is done by issuing a git clone command:

$ git clone git@github.com:krisgeus/ansible_local_cdh_hadoop.git

Install

Installing the software has become a single line command:

$ start-playbook.sh

The script will ask the user for a directory to install the software into. Then it will start to download the packages into the $HOME.ansible-downloads directory. And it will unpack into the install directory the user provided.

In the install directory the script will create a bash_profile add-on to set the correct aliases.

$ source ${INSTALL_DIR}/.bash_profile_hadoop

Testing Hadoop in local mode

$ switch_local_cdh5

Now all the familiar hadoop commands should work. There is no notion of HDFS other then your local filesystem so the hadoop fs -ls / command will show you the same output as ls /

$ hadoop fs -ls /

    drwxrwxr-x   - root admin       2686 2014-04-18 09:47 /Applications
    drwxr-xr-x   - root wheel       2210 2014-02-26 02:46 /Library
    drwxr-xr-x   - root wheel         68 2013-08-25 05:45 /Network
    drwxr-xr-x   - root wheel        136 2013-10-23 03:05 /System
    drwxr-xr-x   - root admin        204 2013-10-23 03:09 /Users
    drwxrwxrwt   - root admin        136 2014-04-18 12:34 /Volumes
    [...]

$ ls -l /

    drwxrwxr-x+ 79 root  admin   2.6K Apr 18 09:47 Applications
    drwxr-xr-x+ 65 root  wheel   2.2K Feb 26 02:46 Library
    drwxr-xr-x@  2 root  wheel    68B Aug 25  2013 Network
    drwxr-xr-x+  4 root  wheel   136B Oct 23 03:05 System
    drwxr-xr-x   6 root  admin   204B Oct 23 03:09 Users
    drwxrwxrwt@  4 root  admin   136B Apr 18 12:34 Volumes

Running a MapReduce job should also work out of the box.

$ cd $HADOOP_PREFIX

$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0-cdh5.0.2.jar pi 10 100

    Number of Maps  = 10
    Samples per Map = 100
    2014-04-19 18:05:01.596 java[74281:1703] Unable to load realm info from SCDynamicStore 14/04/19 18:05:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Wrote input for Map #0
    Wrote input for Map #1
    Wrote input for Map #2
    Wrote input for Map #3
    Wrote input for Map #4
    Wrote input for Map #5
    Wrote input for Map #6
    Wrote input for Map #7
    Wrote input for Map #8
    Wrote input for Map #9
    Starting Job
    ....
    Job Finished in 1.587 seconds
    Estimated value of Pi is 3.14800000000000000000

Testing Hadoop in pseudo-distributed mode

$ switch_psuedo_cdh5
$ hadoop namenode -format
$ start-dfs.sh
$ hadoop fs -ls /
$ hadoop fs -mkdir /bogus
$ hadoop fs -ls /
    2014-04-19 19:46:32.233 java[78176:1703] Unable to load realm info from SCDynamicStore
    14/04/19 19:46:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-
    java classes where applicable
    Found 1 items
    drwxr-xr-x   - user supergroup          0 2014-04-19 19:46 /bogus

Ok HDFS is working, now on to a MapReduce job

$ start-yarn.sh
    starting yarn daemons
    starting resourcemanager, logging to /cdh5.0.0/hadoop-2.3.0-cdh5.0.2/logs/yarn-user-resourcemanager-localdomain.local.out
    Password:
    localhost: starting nodemanager, logging to /cdh5.0.0/hadoop-2.3.0-cdh5.0.2/logs/yarn-user-nodemanager-localdomain.local.out

$ cd $HADOOP_PREFIX
$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0-cdh5.0.2.jar pi 10 100
    Number of Maps  = 10
    Samples per Map = 100
    2014-04-20 10:21:56.696 java[80777:1703] Unable to load realm info from SCDynamicStore
    14/04/20 10:22:11 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Wrote input for Map #0
    Wrote input for Map #1
    Wrote input for Map #2
    Wrote input for Map #3
    Wrote input for Map #4
    Wrote input for Map #5
    Wrote input for Map #6
    Wrote input for Map #7
    Wrote input for Map #8
    Wrote input for Map #9
    Starting Job
    14/04/20 10:22:12 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
    14/04/20 10:22:12 INFO input.FileInputFormat: Total input paths to process : 10
    14/04/20 10:22:12 INFO mapreduce.JobSubmitter: number of splits:10
    14/04/20 10:22:13 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1397969462544_0001
    14/04/20 10:22:13 INFO impl.YarnClientImpl: Submitted application application_1397969462544_0001
    14/04/20 10:22:13 INFO mapreduce.Job: The url to track the job: http://localdomain.local:8088/proxy/application_1397969462544_0001/
    14/04/20 10:22:13 INFO mapreduce.Job: Running job: job_1397969462544_0001
    14/04/20 10:22:34 INFO mapreduce.Job: Job job_1397969462544_0001 running in uber mode : false
    14/04/20 10:22:34 INFO mapreduce.Job:  map 0% reduce 0%
    14/04/20 10:22:53 INFO mapreduce.Job:  map 10% reduce 0%
    14/04/20 10:22:54 INFO mapreduce.Job:  map 20% reduce 0%
    14/04/20 10:22:55 INFO mapreduce.Job:  map 30% reduce 0%
    14/04/20 10:22:56 INFO mapreduce.Job:  map 40% reduce 0%
    14/04/20 10:22:57 INFO mapreduce.Job:  map 50% reduce 0%
    14/04/20 10:22:58 INFO mapreduce.Job:  map 60% reduce 0%
    14/04/20 10:23:12 INFO mapreduce.Job:  map 70% reduce 0%
    14/04/20 10:23:13 INFO mapreduce.Job:  map 80% reduce 0%
    14/04/20 10:23:15 INFO mapreduce.Job:  map 90% reduce 0%
    14/04/20 10:23:16 INFO mapreduce.Job:  map 100% reduce 100%
    14/04/20 10:23:16 INFO mapreduce.Job: Job job_1397969462544_0001 completed successfully
    ...
    Job Finished in 64.352 seconds
    Estimated value of Pi is 3.14800000000000000000

Testing Spark in local mode

$ switch_local_cdh5
$ spark-shell
    SLF4J: Class path contains multiple SLF4J bindings.
    ...
    2014-04-20 09:48:25,238 INFO  [main] spark.HttpServer (Logging.scala:logInfo(49)) - Starting HTTP Server
    2014-04-20 09:48:25,302 INFO  [main] server.Server (Server.java:doStart(266)) - jetty-7.6.8.v20121106
    2014-04-20 09:48:25,333 INFO  [main] server.AbstractConnector (AbstractConnector.java:doStart(338)) - Started SocketConnector@0.0.0.0:62951
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _ / _ / _ `/ __/  '_/
       /___/ .__/_,_/_/ /_/_   version 0.9.0
          /_/

    Using Scala version 2.10.3 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_15)
    Type in expressions to have them evaluated.
    Type :help for more information.
    ...
    Created spark context..
    Spark context available as sc.

    scala>

And we're in!!

Testing Spark in pseudo-distributed mode

Now as a final test we check if spark will work on our pseudo distributed Hadoop config

$ switch_pseudo_cdh5
$ start-dfs.sh
$ start-yarn.sh
$ hadoop fs -mkdir /sourcedata
$ hadoop fs -put somelocal-textfile.txt /sourcedata/sometext.txt
$ spark-shell
    scala> val file = sc.textFile("/sourcedata/sometext.txt")
           file.take(5)

           res1: Array[String] = Array("First", "five lines", "of", "the", "textfile" )

The current version of the ansible scripts are set to install the CDH version 5.0.2 packages. When a new version becomes available this version is easily changed by updating the vars/common.yml Yaml file.

If you have created ansible files to add other packages I welcome you to send me a pull request.

Our Ideas

Explore More Blogs

View All

Deep Knowledge

Deep Knowledge

Deepanshu Garg


FAQ Test BE Puneet Blog updated 102 22 j

FAQ Test BE Puneet Blog updated 102 22 j

Puneet Pathak, Bryan van Schaijk, Deepanshu Garg, Ellis van der Slikke, Jake Richardson


test44

test44

Puneet Pathak, Bartlo Mie, Deepanshu Garg, Adam23172 Rzymowski23172, Anna Małek

Contact

Let’s discuss how we can support your journey.