Archive for category Hadoop

Installing and configuring Hadoop 0.20.205 using its rpm

Recently, I got around to installing Hadoop 0.20.205 using its rpm. I also used the included configuration scripts to create a functional multi node Hadoop configuration. I chose to use a non-secure configuration. I discovered a couple of gotchas along the way.

Pre-requisite: My test cluster consists of 4 CentOS 5.7 VMs each with dual cores and 2GB of memory. I named these 4 VMs ‘master’, ‘slave1’, ‘slave2’, and ‘slave3’. I created a hosts file mapping these names to their IP addresses and copied it over to each of these machines. I also configured the VM ‘master’ to be able to do passwordless ssh into the three slaves.

  • Login to the node ‘master’ as root, and do the following.
  • Download the JDK and install it. I am using JDK 1.6.0 Update 29. Add a file /etc/profile.d/java.sh that sets the env variable JAVA_HOME and adds $JAVA_HOME/bin to the path. Run ‘java -version’ and ensure that you are getting Oracle JDK 1.6 and not openjdk or some other such silliness.
  • Download the rpm ‘hadoop-0.20.205.0-1.i386.rpm’, and install it using ‘rpm --install hadoop-0.20.205.0-1.i386.rpm
  • Hadoop includes a convenient script /usr/sbin/hadoop-setup-conf.sh for generating configuration script (hadoop does not suffer from a paucity of configuration options). First, I need to run this script on the node ‘master’ and generate configuration files. The command line I used was as follows: ‘/usr/sbin/hadoop-setup-conf.sh --namenode-host=master --jobtracker-host=master --conf-dir=/etc/hadoop --hdfs-dir=/var/lib/hadoop/hdfs --namenode-dir=/var/lib/hadoop/hdfs/namenode --mapred-dir=/var/lib/hadoop/mapred --datanode-dir=/var/lib/hadoop/hdfs/data --log-dir=/var/log/hadoop --auto --mapreduce-user=mapred --dfs-support-append=true
  • At this point, logout of the shell, and then login again (as root). This is necessary because a file /etc/profile.d/hadoop-env.sh is created with critical environment variables. Without these env variables sourced, subsequent operations will fail.
  • Now, format the HDFS using the following command ‘/usr/sbin/hadoop-setup-hdfs.sh --format
  • Startup the namenode using ‘/etc/init.d/hadoop-namenode start
  • Startup the jobtracker using ‘/etc/init.d/hadoop-jobtracker start'

At this point, your ‘master’ is ready. Next, we setup the slaves.

  • Login as root into the node ‘slave1’
  • Download and install the JDK. See instructions for master above
  • Download and install the Hadoop RPM. See instructions for master above.
  • Run the same ‘/usr/sbin/hadoop-setup-conf.sh’ command as you did on the master to generate config files. Note that the config files for the slaves are exactly the same as for the master.
  • Finally, run ‘/etc/init.d/hadoop-datanode start‘ and ‘/etc/hadoop-tasktracker start

Once the slaves are setup, browse over to http://master:50070/ to get to the NameNode web UI. Ensure that there are three ‘Live Nodes’ listed. Also, browse over to http://master:50030/ to get to the JobTracker web UI. Ensure that the jobtracker can see three nodes.

As the final step, run the wordcound example. I did so, not as root, but as the user ‘jagane’.

  • First, I created a home directory on HDFS for the user ‘jagane’. Logged into the Linux system ‘master’ as root, I typed ‘/usr/sbin/hadoop-create-user.sh -u jagane
  • Next, I logged into the Linux system ‘master’ as user ‘jagane’ and created an input directory on HDFS, like so: ‘hadoop fs -mkdir /user/jagane/input
  • I am going to run word count on the linux dict, so I type in ‘hadoop fs -copyFromLocal /usr/share/dict/linux.words /user/jagane/input‘ to copy the dict file over to HDFS.
  • Finally, the moment of truth. I typed in ‘hadoop jar /usr/share/hadoop/hadoop-examples-0.20.205.0.jar wordcount /user/jagane/input /user/jagane/output‘. That actually worked. I counted the words in the linux dict.
  • To prove that it worked, I dumped the output using ‘hadoop fs -cat /user/jagane/output/part-r-00000

Well, there you have it. Hadoop 0.20.205 from rpm in a jiffy (‘big data’ jiffy that is).

I’m sure folks have a dozen different ways (chef, puppet, pdsh) of installing and managing Hadoop. But there is something about the elegance of a well packaged rpm, and a nice configuration generation script that is just great.

Congratulations to Eric Yang on putting together the rpm, and its configuration script.

, , , ,

Leave a comment