Monitoring cluster usage and troubleshooting

You can monitor the cluster usage and also monitor and control your jobs using the following examples.

qhost -F will show the resources available for each node.

qmon and ganglia(http://141.80.186.22/ganglia/) will give you visual interface to see the load of the cluster. qmon works when you ssh -X to the login node. Can bu also used to submit jobs other than monitoring.

monitor your jobs

To see a list of all your pending and currently-running jobs, use the qstat command: qstat

To see details about a particular job, use qstat with the -j option. The jobid is the ID number of the job reported by qsub and qstat: qstat -j <jobid>

To see why a pending job has not been scheduled yet, look at the output from qstat -j <jobid>.

By default qstat only shows your own jobs. To see all the jobs on the cluster: qstat -u \*

To check memory used by a job qstat -f -j <jobid>| grep vmem # -f means qstat in full mode. Gives details

delete your jobs

qdel <jobid> will delete the jobs from the queue.

##Troubleshooting:##

###Why my job doesn’t start:### Usually, this is because the requested memory or CPU resources are not available. Sometimes, nodes are taken out of the cluster for maintanence, which reduces the available resources.

###Why my job has failed:### Here are a couple of steps that might help you understand why your job failed. More often than not, you probably exceed the resources you asked for.

  • Read at the error file (default .e)
  • Look at the qacct -j <jobid>
    • Runtime above limit
    • Maxvmem above requested limit?
    • Check the exit status ? (137 means SIGKILL, force killed )
    • record what host the process was running
  • ssh to the host node and grep the jobid in /var/spool/sge/nodeXXX/messages
$ ssh nodeXXX
                    $ grep <jobid> /var/spool/sge/nodeXXX/messages
  • request assistance from helpdesk with all the above information

results matching ""

    No results matching ""