Tuesday, 24 March 2015
Tuesday, 3 March 2015
Fully Distributed Hadoop Federation Cluster
Federation Concepts
"HDFS Federation improves the existing HDFS architecture through a clear separation of namespace and storage, enabling generic block storage layer. It enables support for multiple namespaces in the cluster to improve scalability and isolation. Federation also opens up the architecture, expanding the applicability of HDFS cluster to new implementations and use cases.I have created two guest machines for the namenodes and datanodes to run on. Each VM is allocated 1GB of RAM and 20GB of HDD. Each machine is connected through host-only adapter. Client will be the host machine itself which is able to ssh and connect to both namenodes inside the VirtualBox virtual machines."
-- An excerpt from a Hortonworks blog by Suresh Srinivas.
Find more on Apache Hadoop Federation.
Setup Summary
I have implemented Hadoop Federation inside ESXi server to create a fully distributed cluster on Apache Hadoop 2.6. Setup lists two namenodes and two datanodes each with 1GB RAM and 20GB HDD alloted. Also I have configured a HDFS client on a separate machine with same hardware configuration. All the nodes are running on Ubuntu 14.10. It is also recommended to use latest Oracle JDK for namenodes and datanodes. This setup only demonstrates how to build a federated cluster and hence YARN is not configured.Below mentioned hostnames and ip-addresses are just an example to provide a distinctive view for understanding the cluster setup.
namenode1 fed-nn01   192.168.56.101
namenode2 fed-nn02   192.168.56.102
datanode1 fed-dn01   192.168.56.103
datanode2 fed-dn02   192.168.56.104
client    fed-client 192.168.56.105Downloads
Download the below packages and place the tarballs on all namenodes, datanodes and client machines. Apache Hadoop 2.6 Download Oracle JDK8 DownloadNote: Before moving directly towards hadoop installation and configuration - 1. Disable firewall on all namenodes, datanodes and client machines. 2. Comment 127.0.1.1 address line in /etc/hosts file on all namenodes, datanodes and client machines. 3. Update correct hostnames and their respective ip-addresses of all namenodes, datanodes and client machines in their /etc/hosts file. 4. Install ssh on all namenodes and datanodes. Not required for client machine.
INSTALLATION & CONFIGURATION
Hadoop installation and configuration includes setting up of hadoop user, installing java, passwordless ssh, and finally hadoop installation, configuration and monitoring.User Configuration
Location:fed-nn01, fed-nn02, fed-dn01, fed-dn02, fed-clientFirst we will create a group “hadoop” and a user “huser” for all hadoop administrative tasks and set password for huser.
$ sudo groupadd hadoop
$ sudo useradd -m -d /home/huser -g hadoop huser
$ sudo passwd huserJava Installation
We will go for the recommended Oracle JDK installation and configuration. At the time of writing this document JDK8 was the latest that was available.Location:
fed-nn01, fed-nn02, fed-dn01, fed-dn02Make a directory named java inside
/usrhuser:~$ sudo mkdir /usr/javaCopy the tarball.
huser:~$ sudo cp /path/to/jdk-8u25-linux-x64.tar.gz /usr/javahuser:~$ sudo tar -xzvf /usr/java/ jdk-8u25-linux-x64.tar.gzSet the environment for jdk8.
huser:~$ sudo vi /etc/profile.d/java.sh
JAVA_HOME=/usr/java/jdk1.8.0_25/
PATH=$JAVA_HOME/bin:$PATH
export PATH JAVA_HOME
export CLASSPATH=.
huser:~$ sudo chmod +x /etc/profile.d/java.sh
huser:~$ source /etc/profile.d/java.shhuser:~$ java -version
java version "1.8.0_25"
Java(TM) SE Runtime Environment (build 1.8.0_25-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)Passwordless SSH Configuration
Passwordless ssh is required by the namenode only to start the HDFS and MapReduce daemons in various nodes.Location:
fed-nn01huser@fed01:~$ ssh-keygen -t rsa
huser@fed01:~$ touch /home/huser/.ssh/authorized_keys
huser@fed01:~$ cp /home/huser/.ssh/id_rsa.pub authorized_keys
huser@fed01:~$ ssh-copy-id -i ~/.ssh/id_rsa.pub huser@fed-nn02
huser@fed01:~$ ssh-copy-id -i ~/.ssh/id_rsa.pub huser@fed-dn01
huser@fed01:~$ ssh-copy-id -i ~/.ssh/id_rsa.pub huser@fed-dn02fed-nn02huser@fed02:~$ ssh-keygen -t rsa
huser@fed02:~$ touch /home/huser/.ssh/authorized_keys
huser@fed02:~$ cp /home/huser/.ssh/id_rsa.pub authorized_keys
huser@fed02:~$ ssh-copy-id -i ~/.ssh/id_rsa.pub huser@fed-nn01
huser@fed02:~$ ssh-copy-id -i ~/.ssh/id_rsa.pub huser@fed-dn01
huser@fed02:~$ ssh-copy-id -i ~/.ssh/id_rsa.pub huser@fed-dn02ssh on fed-nn01 & fed-nn02Run below commands to test the passwordless logins from both fed-nn01 & fed-nn02.
huser:~$ ssh fed-nn01
huser:~$ ssh fed-nn02
huser:~$ ssh fed-dn01
huser:~$ ssh fed-dn02Whereas, client communicates with the datanode directly for I/O operations (read/write) using Data Transfer Protocol defined in DataTransferProtocol.java. For the purpose of performance this is a streaming protocol and not RPC. Before streaming the block to the datanode, the client buffers the data until a full block (64MB/128MB) has been created.
Hadoop Installation
We will go forth with the latest stable Apache Hadoop 2.6.0 release.Location:
fed-nn01, fed-nn02, fed-dn01, fed-dn02, fed-clientWe will untar the tarball and place it in
/opt directory.huser:~$ tar -xzvf hadoop-2.6.0.tar.gzChanging the ownership of the directory to huser.
huser:~$ sudo chown -R huser:hadoop /opt/hadoop-2.6.0/Setting the environment variables in .bashrc file of “huser” user.
huser:~$ sudo vi ~/.bashrc
###JAVA CONFIGURATION###
JAVA_HOME=/usr/java/jdk1.8.0_25/
PATH=$JAVA_HOME/bin:$PATH
export PATH JAVA_HOME
export CLASSPATH=.
###HADOOP CONFIGURATION###
HADOOP_PREFIX=/opt/hadoop-2.6.0/
PATH=$HADOOP_PREFIX/bin:$PATH
PATH=$HADOOP_PREFIX/sbin:$PATH
export PATHhuser:~$ exec bashTesting Hadoop Installation
huser:~$ hadoop version
Hadoop 2.6.0
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r e3496499ecb8d220fba99dc5ed4c99c8f9e33bb1
Compiled by jenkins on 2014-11-13T21:10Z
Compiled with protoc 2.5.0
From source with checksum 18e43357c8f927c0695f1e9522859d6a
This command was run using /opt/hadoop-2.6.0/share/hadoop/common/hadoop-common-2.6.0.jarHadoop Configuration
There are a few files to be configured which will make hadoop up and running. For us these configuration files reside in /opt/hadoop-2.6.0/etc/hadoop/ directory.hadoop-env.sh Location: fed-nn01, fed-nn02, fed-dn01, fed-dn02, fed-clienthuser:~$ vi /opt/hadoop-2.6.0/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/java/jdk1.8.0_25/
export HADOOP_LOG_DIR=/var/log/hadoop/$ sudo mkdir /var/log/hadoop
$ sudo chown -R huser:hadoop /var/log/hadoopcore-site.xmlLocation:
fed-nn01huser@fed-nn01:~$ sudo vi /opt/hadoop-2.6.0/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://fed-nn01</value>
</property>
</configuration>fed-nn02huser@fed-nn02:~$ sudo vi /opt/hadoop-2.6.0/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://fed-nn02</value>
</property>
</configuration>fed-dn01, fed-dn02huser@fed-dn01:~$ sudo vi /opt/hadoop-2.6.0/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://fed-nn01,hdfs://fed-nn02</value>
</property>
</configuration>fed-clienthuser@fed-client:~$ sudo vi /opt/hadoop-2.6.0/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>viewfs://fedcluster/</value>
</property>
<property>
<name>fs.viewfs.mounttable.fedcluster.link./ns1</name>
<value>hdfs://fed-nn01:8020</value>
</property>
<property>
<name>fs.viewfs.mounttable.fedcluster.link./ns2</name>
<value>hdfs://fed-nn02:8020</value>
</property>
</configuration>hdfs-site.xmlLocation:
fed-nn01, fed-nn02hsuer:~$ sudo vi /opt/hadoop-2.6.0/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///hdfs/name</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.nameservices</name>
<value>ns1,ns2</value>
</property>
<property>
<name>dfs.namenode.rpc-address.ns1</name>
<value>fed01:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.ns2</name>
<value>fed02:8020</value>
</property>
</configuration>$ sudo mkdir -p /hdfs/name
$ sudo chown -R huser:hadoop /hdfs/fed-dn01, fed-dn02hsuer:~$ sudo vi /opt/hadoop-2.6.0/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///hdfs/data</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.nameservices</name>
<value>ns1,ns2</value>
</property>
<property>
<name>dfs.namenode.rpc-address.ns1</name>
<value>fed01:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.ns2</name>
<value>fed02:8020</value>
</property>
</configuration>huser:~$ sudo mkdir -p /hdfs/data
huser:~$ sudo chown -R huser:hadoop /hdfs/fed-client<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.nameservices</name>
<value>ns1,ns2</value>
</property>
<property>
<name>dfs.namenode.rpc-address.ns1</name>
<value>fed-nn01:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.ns2</name>
<value>fed-nn02:8020</value>
</property>
</configuration>Location:
fed-nn01, fed-nn02huser:~$ vi /opt/hadoop-2.6.0/etc/hadoop/slaves
fed-dn01
fed-dn02Formatting HDFS Filesystem
Location:fed-nn01, fed-nn02huser:~$ hdfs namenode -format -clusterID fedclusterNote: ClusterID if not specified is randomly chosen by the system.
Starting HDFS
Run the below command either on fed-nn01 or fed-nn02 namenode to start the dfs.$ start-dfs.shAt this point, listing of files directly through fed-nn01 and fed-nn02 will show the files that are present only in their respective configured namespaces i.e. ns1 and ns2. Whereas, listing of files on client machine will show both namespaces mountpoints mounted under /. Though for list or copy/retrieve file operations from one namenode to other can be done by specifying full HDFS URI in the commands.
Examples:
Listing files in
fed-nn01 namenodehuser@fed-nn01:~$ hdfs dfs -ls -h /
Found 2 items
-rw-r--r-- 2 huser hadoop 128 M 2015-01-02 16:23 /512mb-junk
drwxr-xr-x - huser hadoop 0 2015-01-02 16:28 /userfed-nn02 namenode from fed-nn01 namenodehuser@fed-nn01:~$ hdfs dfs -ls hdfs://fed-nn02:8020/user
Found 4 items
-rw-r--r-- 2 huser hadoop 128 M 2015-01-02 16:26 hdfs://hdfs-fed02:8020/user/512mb-junk
drwxr-xr-x - huser hadoop 0 2015-01-03 02:32 hdfs://hdfs-fed02:8020/user/dir1
-rw-r--r-- 2 huser hadoop 1 G 2015-01-02 15:25 hdfs://hdfs-fed02:8020/user/lgb-junk
-rw-r--r-- 2 huser hadoop 1 G 2015-01-02 15:31 hdfs://hdfs-fed02:8020/user/lgb-junk.1fed-client machineshuser@fed-client:~$ hdfs dfs -ls /
Found 2 items
-r-xr-xr-x - ibm ibm 0 2015-01-03 02:55 /ns1
-r-xr-xr-x - ibm ibm 0 2015-01-03 02:55 /ns2fed-nn02 from fed-client$ hdfs dfs -ls -h /b/user
Found 4 items
-rw-r--r-- 2 huser hadoop 128 M 2015-01-02 16:26 hdfs://hdfs-fed02:8020/user/512mb-junk
drwxr-xr-x - huser hadoop 0 2015-01-03 02:32 hdfs://hdfs-fed02:8020/user/dir1
-rw-r--r-- 2 huser hadoop 1 G 2015-01-02 15:25 hdfs://hdfs-fed02:8020/user/lgb-junk
-rw-r--r-- 2 huser hadoop 1 G 2015-01-02 15:31 hdfs://hdfs-fed02:8020/user/lgb-junk.1Monitoring
Below http links are also helpful in monitoring and browsing HDFS filesystem using a web-browser on any of the namenodes or client machines.http://fed-nn01:50070/dfshealth.jsp
http://fed-nn02:50070/dfshealth.jsp
http://fed-nn01:50070/dfsclusterhealth.jsp
http://fed-nn02:50070/dfsclusterhealth.jsp
Related Links
http://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/ViewFs.html http://hashprompt.blogspot.pt/2015/01/fully-distributed-hadoop-federation.html http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/Federation.html https://www.usenix.org/legacy/event/hotos09/tech/full_papers/ko/ko_html/ http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/ http://hadoop.apache.org/docs/r1.0.4/webhdfs.html#File+and+Directory+Operations
Subscribe to:
Comments (Atom)
