Difference between revisions of "Hadoop"

From Cncz
Jump to: navigation, search
(Apache Hadoop documentation)
m
 
(9 intermediate revisions by one other user not shown)
Line 1: Line 1:
===Apache Hadoop documentation===
+
==Running Hadoop in Terminal Rooms==
* [http://hadoop.apache.org/docs/stable/single_node_setup.html Single node setup]
+
To setup a hadoop hadoop cluster in a terminal room, make sure you have booted some PC's with Ubuntu Linux 14.04. Write down the names of the PC's you want to use as slave nodes.
* [http://hadoop.apache.org/docs/stable/cluster_setup.html Cluster setup]
+
Then run tkhadoop.sh:
  
===Setup terminal rooms===
+
/usr/local/bin/tkhadoop.sh [slaves]
An Ubuntu package for hadoop (downloaded from [http://ftp.nluug.nl/internet/apache/hadoop/common/ ftp.nluug.nl]) has been added to the science ubuntu repository.
 
hadoop_1.1.1-1_x86_64.deb
 
  
Local users
+
For example, when you're physically working with hg137pc01 and pc 02 and 03 are available and running linux, use them as slave nodes. Before we setup the hadoop cluster, test if you can login to the slave nodes:
uid: 201 for hdfs
 
uid: 202 for mapred
 
gid: 49 for hadoop
 
  
In /etc/hadoop/hadoop-env.sh, the HADOOP_CLIENT_OPTS environment variable has been changed from -Xmx128m to -Xmx1024m.
+
$ ssh hg137pc02
 +
$ exit
  
===Stand-alone test===
+
and
With this setup, we could successfully run the example job:
+
 
  $ cd /scratch/
+
$ ssh hg137pc03
$ mkdir input
+
$ exit
  $ cp /usr/share/hadoop/templates/conf/*.xml input  
+
 
  $ hadoop jar /usr/share/hadoop/hadoop-examples-1.1.1.jar grep input output 'dfs[a-z.]+'  
+
You should be able to login to these computers without entering your password. If that does not succeed, type:
 +
 
 +
$ kinit
 +
 
 +
..you'll be prompted for your password. After that, you should be able to login to the slave nodes without providing your password.
 +
 
 +
 
 +
To setup the temporary hadoop environment for you, use the following command:
 +
 
 +
$ tkhadoop.sh hg137pc02.science.ru.nl hg137pc03.science.ru.nl*
 +
 
 +
*Be sure to use fully qualified domain names slave host names. This is required for kerberos based ssh authentication, used in this our script and the scripts that are bundled with hadoop.
 +
 
 +
This will setup the files needed to run a three nodes hadoop cluster. The host on which you execute tkhadoop will be the master node and will be used as slave as well.
 +
You'll find your hadoop installation in:
 +
/tmp/username/hadoop
 +
 
 +
From within this directory, you can test the examples as documented on the [http://hadoop.apache.org/docs/stable/single_node_setup.html apache hadoop website]:
 +
  $ cd /tmp/$USER/hadoop
 +
 
 +
Browse the web interface for the NameNode and the JobTracker; by default they are available at:
 +
  NameNode  - http://localhost:50070/
 +
JobTracker - http://localhost:50030/
 +
 
 +
Copy the input files into the distributed filesystem:
 +
  $ bin/hadoop fs -put conf input
 +
 
 +
Run some of the examples provided:
 +
$ bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
 +
 
 +
Examine the output files:
 +
Copy the output files from the distributed filesystem to the local filesytem and examine them:
 +
$ bin/hadoop fs -get output output
 
  $ cat output/*
 
  $ cat output/*
 +
 +
or
 +
 +
View the output files on the distributed filesystem:
 +
$ bin/hadoop fs -cat output/*
 +
 +
When you're done, stop the daemons with:
 +
$ bin/stop-all.sh
 +
 +
Make sure to cleanup files in /tmp/$USER on the master and slave nodes.
 +
 +
==tkhadoop.sh==
 +
tkhadoop.sh overwrites the following configuration files:
 +
hadoop-env.sh
 +
core-site.xml
 +
hdfs-site.xml
 +
mapred-site.xml
 +
masters
 +
slaves
 +
 +
The file ''masters'' will contain the hostname on which tkhadoop.sh is executed. ''slaves'' contains the master node, as well as the hosts specified on the command line.

Latest revision as of 12:18, 24 August 2015

Running Hadoop in Terminal Rooms

To setup a hadoop hadoop cluster in a terminal room, make sure you have booted some PC's with Ubuntu Linux 14.04. Write down the names of the PC's you want to use as slave nodes. Then run tkhadoop.sh:

/usr/local/bin/tkhadoop.sh [slaves]

For example, when you're physically working with hg137pc01 and pc 02 and 03 are available and running linux, use them as slave nodes. Before we setup the hadoop cluster, test if you can login to the slave nodes:

$ ssh hg137pc02
$ exit

and

$ ssh hg137pc03
$ exit

You should be able to login to these computers without entering your password. If that does not succeed, type:

$ kinit

..you'll be prompted for your password. After that, you should be able to login to the slave nodes without providing your password.


To setup the temporary hadoop environment for you, use the following command:

$ tkhadoop.sh hg137pc02.science.ru.nl hg137pc03.science.ru.nl*
  • Be sure to use fully qualified domain names slave host names. This is required for kerberos based ssh authentication, used in this our script and the scripts that are bundled with hadoop.

This will setup the files needed to run a three nodes hadoop cluster. The host on which you execute tkhadoop will be the master node and will be used as slave as well. You'll find your hadoop installation in:

/tmp/username/hadoop

From within this directory, you can test the examples as documented on the apache hadoop website:

$ cd /tmp/$USER/hadoop

Browse the web interface for the NameNode and the JobTracker; by default they are available at:

NameNode   - http://localhost:50070/
JobTracker - http://localhost:50030/

Copy the input files into the distributed filesystem:

$ bin/hadoop fs -put conf input

Run some of the examples provided:

$ bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'

Examine the output files: Copy the output files from the distributed filesystem to the local filesytem and examine them:

$ bin/hadoop fs -get output output 
$ cat output/*

or

View the output files on the distributed filesystem:

$ bin/hadoop fs -cat output/*

When you're done, stop the daemons with:

$ bin/stop-all.sh

Make sure to cleanup files in /tmp/$USER on the master and slave nodes.

tkhadoop.sh

tkhadoop.sh overwrites the following configuration files:

hadoop-env.sh
core-site.xml
hdfs-site.xml
mapred-site.xml
masters
slaves

The file masters will contain the hostname on which tkhadoop.sh is executed. slaves contains the master node, as well as the hosts specified on the command line.