You can monitor the cluster usage and also monitor and control your jobs using the following examples.
qhost -F will show the resources available for each node.
qmon and ganglia(http://126.96.36.199/ganglia/) will give you visual interface to see the load of the cluster. qmon works when you ssh -X to the login node. Can bu also used to submit jobs other than monitoring.
To see a list of all your pending and currently-running jobs, use the qstat command:
To see details about a particular job, use qstat with the -j option. The jobid is the ID number of the job reported by qsub and qstat:
qstat -j <jobid>
To see why a pending job has not been scheduled yet, look at the output from
qstat -j <jobid>.
By default qstat only shows your own jobs. To see all the jobs on the cluster:
qstat -u \*
To check memory used by a job
qstat -f -j <jobid>| grep vmem # -f means qstat in full mode. Gives details
qdel <jobid> will delete the jobs from the queue.
Usually, this is because the requested memory or CPU resources are not available. Sometimes, nodes are taken out of the cluster for maintanence, which reduces the available resources.
Here are a couple of steps that might help you understand why your job failed. More often than not, you probably exceed the resources you asked for.
qacct -j <jobid>
$ ssh nodeXXX $ grep <jobid> /var/spool/sge/nodeXXX/messages