Posts Tagged big data
Recently, I got around to installing Hadoop 0.20.205 using its rpm. I also used the included configuration scripts to create a functional multi node Hadoop configuration. I chose to use a non-secure configuration. I discovered a couple of gotchas along the way.
Pre-requisite: My test cluster consists of 4 CentOS 5.7 VMs each with dual cores and 2GB of memory. I named these 4 VMs ‘master’, ‘slave1’, ‘slave2’, and ‘slave3’. I created a hosts file mapping these names to their IP addresses and copied it over to each of these machines. I also configured the VM ‘master’ to be able to do passwordless ssh into the three slaves.
- Login to the node ‘master’ as root, and do the following.
- Download the JDK and install it. I am using JDK 1.6.0 Update 29. Add a file /etc/profile.d/java.sh that sets the env variable JAVA_HOME and adds $JAVA_HOME/bin to the path. Run ‘java -version’ and ensure that you are getting Oracle JDK 1.6 and not openjdk or some other such silliness.
- Download the rpm ‘hadoop-0.20.205.0-1.i386.rpm’, and install it using ‘
rpm --install hadoop-0.20.205.0-1.i386.rpm‘
- Hadoop includes a convenient script /usr/sbin/hadoop-setup-conf.sh for generating configuration script (hadoop does not suffer from a paucity of configuration options). First, I need to run this script on the node ‘master’ and generate configuration files. The command line I used was as follows: ‘
/usr/sbin/hadoop-setup-conf.sh --namenode-host=master --jobtracker-host=master --conf-dir=/etc/hadoop --hdfs-dir=/var/lib/hadoop/hdfs --namenode-dir=/var/lib/hadoop/hdfs/namenode --mapred-dir=/var/lib/hadoop/mapred --datanode-dir=/var/lib/hadoop/hdfs/data --log-dir=/var/log/hadoop --auto --mapreduce-user=mapred --dfs-support-append=true‘
- At this point, logout of the shell, and then login again (as root). This is necessary because a file /etc/profile.d/hadoop-env.sh is created with critical environment variables. Without these env variables sourced, subsequent operations will fail.
- Now, format the HDFS using the following command ‘
- Startup the namenode using ‘
- Startup the jobtracker using ‘
At this point, your ‘master’ is ready. Next, we setup the slaves.
- Login as root into the node ‘slave1’
- Download and install the JDK. See instructions for master above
- Download and install the Hadoop RPM. See instructions for master above.
- Run the same ‘/usr/sbin/hadoop-setup-conf.sh’ command as you did on the master to generate config files. Note that the config files for the slaves are exactly the same as for the master.
- Finally, run ‘
/etc/init.d/hadoop-datanode start‘ and ‘
Once the slaves are setup, browse over to http://master:50070/ to get to the NameNode web UI. Ensure that there are three ‘Live Nodes’ listed. Also, browse over to http://master:50030/ to get to the JobTracker web UI. Ensure that the jobtracker can see three nodes.
As the final step, run the wordcound example. I did so, not as root, but as the user ‘jagane’.
- First, I created a home directory on HDFS for the user ‘jagane’. Logged into the Linux system ‘master’ as root, I typed ‘
/usr/sbin/hadoop-create-user.sh -u jagane‘
- Next, I logged into the Linux system ‘master’ as user ‘jagane’ and created an input directory on HDFS, like so: ‘
hadoop fs -mkdir /user/jagane/input‘
- I am going to run word count on the linux dict, so I type in ‘
hadoop fs -copyFromLocal /usr/share/dict/linux.words /user/jagane/input‘ to copy the dict file over to HDFS.
- Finally, the moment of truth. I typed in ‘
hadoop jar /usr/share/hadoop/hadoop-examples-0.20.205.0.jar wordcount /user/jagane/input /user/jagane/output‘. That actually worked. I counted the words in the linux dict.
- To prove that it worked, I dumped the output using ‘
hadoop fs -cat /user/jagane/output/part-r-00000‘
Well, there you have it. Hadoop 0.20.205 from rpm in a jiffy (‘big data’ jiffy that is).
I’m sure folks have a dozen different ways (chef, puppet, pdsh) of installing and managing Hadoop. But there is something about the elegance of a well packaged rpm, and a nice configuration generation script that is just great.
Congratulations to Eric Yang on putting together the rpm, and its configuration script.