layout | title | permalink |
---|---|---|
page |
HGCC |
/hgcc/ |
- Revision: 2017-01-03
- Original by: Viren Patel
- Edited by: TS Wingo
- Original markdown file
- Requirements:
- Familiarity with Linux command line
- Secure Shell (SSH) setup on your computer.
- HGCC consists of one head node and 9 compute nodes.
- The computes nodes have varying amounts of RAM, CPU (cores), and local
scratch space (
/tmp
) - Head node is called node00
- Compute nodes are called node01, node02, …
- HGCC uses Sun Grid Engine to schedule and run jobs on the cluster.
- The main command for submitting jobs is
qsub
. - This tutorial will demonstrate how to use qsub effectively.
- There are two queues defined on HGCC – b.q and i.q
b.q
- For batch (non-interactive) jobs
- Restricted to node01 – node06
- Job defaults: 1 core / 8GB RAM
- 240 hours maximum run time
- Requestable resources:
- Cores
- Run time
- Memory is not requestable per se; you get 8 GB per core requested (See slide on requesting additional resources).
i.q
- For interactive jobs, e.g. to run program with a GUI, or requiring command line access
- Restricted to node07 – node09
- Job defaults
- 1 core / 8GB RAM
- 24 hours max. run time
- Requestable resources:
- Cores
- Use local scratch space
- Use shell scripts
- Use modules
- Use common data sets already available
Common data sets, e.g. hg38 reference genome, are available in
/sw/hgcc/Data
. Check there first before downloading to your home directory.
Help reduce data duplication. If you have these data in your home directory please delete them. If you need some data make a request.
- This reduces the network traffic to the /home volume
- Each compute node has a 1TB /scratch partition
- Note:
/scratch
on node01 is distinct from/scratch
on node02 - Recommended mode of operation is:
- Create unique folder in /scratch (e.g.
/scratch/x6d3es
) - Copy data to unique folder
- Do not copy your entire data set. Copy only the data file(s) you need, e.g. the two fastq files for one sample you’re mapping
- Process and write results to unique folder
- Copy results from unique folder to
/home/<your username>/<Project Name>/
- Delete unique folder
- Create unique folder in /scratch (e.g.
Sometimes input data is large (e.g. WGS data). In that case, do not copy to unique folder; instead read from /home but still use unique folder to process your data (and generate output) and then copy results back.
- Create a SGE job script to run your program/pipeline and copy data
to and from
/scratch
- Using a job script will allow running multiple commands (pipeline) as one job
- Use
qsub
to submit your script to SGE. SGE will schedule the script to run on a compute node.
- Commands (square brackets indicate optional information)
module avail # Display available modules
module load <name[/version]\> # Load a module
module list # List loaded modules
module unload <name[/version]\> # Unload a module
module purge # Unload all loaded modules
- Create a folder to hold all files related to the task/project
- Recommended folder structure
${HOME}/project
${HOME}/project/data
${HOME}/project/refs
${HOME}/project/logs
${HOME}/project/output
${HOME}/project/sge
- Create the job submission script in
${HOME}/project/sge
- Recommend to create scripts for each step, e.g. FastQC, mapping, calling, etc.
- Give a descriptive name to your scripts e.g.
step01_fastqc.sh
#!/bin/sh
# This script requires a single parameter when called – the portion of the
# file name preceding '.fastq.gz' or '.bam'. This is usually the <sample_name>.
# The output directory (OUTDIR) needs to exist.
# load the FastQC module, which gives you the fastqc program
module load FastQC
# set directory variable names
PRJDIR=“${HOME}/project”
DATADIR=“${PRJDIR}/data”
OUTDIR=“${PRJDIR}/output/FastQC”
# create a unique folder on the local compute drive
if [ -e /bin/mktemp ]; then
TMPDIR=`/bin/mktemp -d -p /scratch/` || exit
elif [ -e /usr/bin/mktemp ]; then
TMPDIR=`/usr/bin/mktemp -d –p /scratch/` || exit
else
echo “Error. Cannot find mktemp to create tmp directory”
exit
fi
# copy on the data
cp ${DATADIR}/$1.fastq.gz ${TMPDIR}
# run fastqc on the data
fastqc –o ${TMPDIR} --no-extract ${TMPDIR}/$1.fastq.gz
# remove the original fastq file
/bin/rm ${TMPDIR}/$1.fastq.gz
# copy your local data to your user directory
rsync –av ${TMPDIR}/ ${OUTDIR}/$1
# remove the temp directory
/bin/rm –fr ${TMPDIR}
# unload the FastQC module
module unload FastQC
- submit your job:
# change to the log folder
cd ${HOME}/project/logs
# submit the job
qsub –q b.q –cwd –j y ../sge/step01_fasqtc.sh <sample_name>
- This command will run your job, generate logs in the current directory, and merge the .o and .e files into one.
- You may include other SGE options in the above command line or in your script.
- One useful option is to have SGE email you when the job completes:
qsub –q b.q –cwd –j y –M [email protected] ../sge/step01_fastqc.sh <sample_name>
- SGE options may also be included in the job script instead of specifying them on the command line.
- Assumption: you have many fastq files in your
project/data
folder. - Complete steps 1 and 2 as shown for Example 1.
- Here's a shell loop command to submit all of your fastq files (notice relative path to the data folder)
# change to the log directory
cd ${HOME}/project/logs
# loop through your gzipped fastq files
for F in `find ../data –name \*.fastq.gz –print`; do
S=$(basename $F | sed 's/\.fastq\.gz$//')
qsub –q b.q –cwd –j y ../sge/step01_fastqc.sh $S
done
Note: qsub limit is 500 jobs. For larger numbers use array jobs.
- Use qstat to check the status of your jobs
qstat
qstat
by itself will only list your jobs- To list all currently running and scheduled jobs
qstat –u ‘\*’
- Use
qdel
to delete a job qdel
takes the job-ID from qstat
# usage: qdel <job Id>
qdel 37788
Of course, you can only delete your jobs.
qdel 37788
vpatel - you do not have the necessary privileges to delete the job
"37788"
qsub –q b.q –pe smp 4 …
- Notes:
- Requesting additional cores also provides additional memory
- 1 core = 8 GB, 2 cores = 16GB, 4 cores = 32GB, …
- Your program(s) must be able to take advantage of multiple cores or additional memory.
- You may have to specify this via the program’s command line options,
e.g. specifying
–p
option forbowtie2
. See the bowtie manual - The smp parallel environment requires that the requested number of cores be free/available on a single node, otherwise you job will not run.
- Using more cores/memory may not result in a dramatic performance improvement. Think about possibly breaking your analysis into multiple jobs/steps and running those jobs/steps concurrently on multiple nodes.
- Multiple small jobs may be more efficient than a single large job. It also is more user-friendly.
qsub –q b.q -l h_rt=hh:mm:ss ...
-
hh = hours, mm = minutes, ss = seconds
-
Notes:
- Default run time for batch jobs is 240 hours.
- This is sufficient for 99.9% of jobs on HGCC. If your job is taking more than 240 hours to run, it’s probably stuck and should be terminated.
- You can also request a shorter run time, e.g. for testing purposes
qsub –q b.q –l h_rt=1:00:00 ...
- The above will run your job for one hour then automatically terminate it.
- Interactive jobs have a maximum run time of 24 hours.
- Use
qrsh
to run a command
qrsh –q i.q ‘hostname’
- To run an interactive program like R
qrsh –q i.q ‘module load R && R --no-save && module unload R’
- Note:
--no-save
, or--save
, or--vanilla
are required to R run via the interactive queue
- OS X/Windows users will need to install X server software
- OS X: install Xquartz
- Windows: install Xming
Example setup:
ssh –X [email protected] # use –X option when you ssh to HGCC
qlogin –q i.q # use qlogin to establish a session on a compute node
xterm # launch the GUI program