CyStorm - Using Hadoop

Hadoop portion of the cluster can be used via submitting a job into torque queue hadoop2. The hadoop2 queue is a serial queue. Only one job runs at a time to ensure a consistent environment for users.

Students taking course CprE/SE 419 should submit their jobs into torque queue studenthadoop.

Jobs should be submitted from the headnode (cystorm.its). The job scheduler torque is typically used with a submission script but this is not required. Jobs can be submitted to hadoop2 queue with the following command:

   qsub -q hadoop2 -j oe -o <path> -v VARIABLE=VALUE {<script> | <command>}
where

Similarly, jobs can be submitted to studenthadoop queue with the following command:

   qsub -q studenthadoop -j oe -o <path> -v VARIABLE=VALUE {<script> | <command>}

Issue "man qsub" to learn more about qsub. This information is also available at the Cluster Resources website.

Scripts for hadoop job submission typically use the form

   yarn  [mainClass] args...

Add

   PATH="/hadoop/bin:$PATH"
to your .bashrc so that yarn can be found.

More information can be found at the Hadoop website.

The node namenode2 should be used to upload and offload files to/from Hadoop File System (HDFS). To load files to HDFS first copy them to cystorm as described in How to logon and transfer files, then login to namenode2 and use hadoop FS copyFromLocal command. Both home and working directories are mounted on the headnode and namenode2. The description of Hadoop FS commands can be found at the Hadoop website.

The "qstat" command will show a brief status of your job in torque. "showq" command displays a more user friendly list of the queue content.

Example

Hadoop installation includes a word count example. Find the detailed instructions on how to run this example on cystorm at Word Count Example.

Notes about hadoop queue