Common mistakes to avoid
Mistakes happen, but with a little know-how, you can steer clear of common pitfalls that might trip you up on your RCC journey. Here's what to watch out for:
-
Don't Overstuff Your Storage: Do not go over your storage quota.
Going over your storage quota can spell trouble, leading to job failures, cryptic error messages, and X11 forwarding woes. Keep tabs on your usage with theaccounts quota
command and avoid unnecessary headaches. Learn more about what to do if you find yourself over quota here. -
Beware of Conda Environments: Do not install conda environments using
conda create --name=<env_name>
.
Installing Anaconda environments in the wrong directory can quickly eat up precious storage space. Avoid the trap by creating virtual environments in the/project
directory instead. Check out the command to do so and save yourself the hassle. By default, this command will install a virtual conda environment into the/home
directory with a low quota. Because Anaconda environments often require a large amount of storage and a number of files, this may easily result in exceeding quota. Instead install virtual environments inside /project or /project2 directories by runningconda create --prefix=/project/<PI_CnetID>/<CNetID/anaconda/<env_name>
. -
Do not run jobs on the login nodes.
The login nodes are for quick tasks, not heavy lifting. Save the computational heavyweights for the Slurm job scheduler, or risk getting your account benched. When you connect to a cluster via SSH, you land on the login node, which all users share. The login node is reserved for submitting jobs, compiling codes, installing software, and running short tests that use only a few CPU cores and finish within a few minutes. Anything more intensive must be submitted to the Slurm job scheduler as either a batch or interactive job. Failure to comply with this rule may result in your account being temporarily suspended. -
Stay Offline in Jobs: Do not try to access the Internet from batch or interactive jobs.
Don't waste precious service units by allocating more CPU cores than you need for serial jobs. Keep it efficient and stick to one core for the job at hand. Jobs submitted to Slurm's domain don't have internet access. Plan ahead and make sure any necessary downloads or installations are done on the login node before you hit submit. All jobs submitted to the Slurm job scheduler run on compute nodes without Internet access, including ThinLinc sessions. Because of this, a running job cannot download files, install packages, or connect to GitHub. Before submitting the job, you will need to perform these operations on the login node. -
One Core is Enough for Serial Jobs: Do not allocate more than one CPU core for serial jobs.
Before diving into parallel computing, make sure you've done your scaling analysis. It's the smart move for maximizing efficiency and resource utilization. Serial codes cannot run in parallel, so using more than one CPU core will not cause the job to run faster. Instead, doing so will waste SUs allocated to your PI. Visit the Batch jobs page of the user guide to learn about job arrays, which are a convenient way to submit a large number of independent processing jobs, and run parallel batch jobs. -
Scale Up Responsibly: Do not run jobs with a parallel code without conducting a scaling analysis.
If your code runs in parallel, you need to determine the optimal number of nodes and CPU cores. The same is true if it can use multiple GPUs. To do this, perform a scaling analysis. -
GPU for GPU Codes Only: Do not request a GPU for a code that can only use CPUs. Only codes explicitly written to use GPUs can take advantage of GPUs.
Don't hog the GPUs unless your code can actually put them to good use. Allocating resources you don't need only slows things down for everyone else. Allocating a GPU for a CPU-only code will not speed up the execution time, but it will increase your queue time, waste resources, and lower the priority of your next job submission. Furthermore, some codes are written to use only a single GPU. -
Keep Your GCC Up-to-Date: Do not use the system GCC when a newer version is needed
Don't let outdated tools hold you back. If you need a newer version of GCC, load it up and keep your development environment cutting-edge. The GNU Compiler Collection (GCC) provides a suite of compilers (gcc, g++, gfortran) and related tools. You can determine the version of the system GCC by running the command "gcc --version". If the system version is insufficient then load the appropriate environment module to make a newer version available. Run the command "module avail gcc". Learn more about environment modules in the Software section of the user guide. -
Name That Module: Do not load environment modules using only the partial name.
When loading environment modules, be specific. Don't leave it to chance—use the full name to avoid any surprises down the line. When loading environment modules, be specific. Don't leave it to chance—use the full name to avoid any surprises down the line. A common practice for Python users is to issue the "module load python" command. You should always specify the full name of the environment module (e.g., module load python/anaconda-2022.05) to ensure the default version hasn't changed. Also, you should avoid loading environment modules in your ~/.bashrc file. Instead, do this in Slurm scripts and on the command line when needed. -
Mind Your Scratch: Do not write temporary files to
/scratch
if your job has high throughput I/O
If your job deals with lots of small files, think twice about dumping them in/scratch
. Consider your options and choose wisely to optimize performance. Temporary files may accumulate and exceed your quota (both in size and/or number of files) unless removed on time. If your job is not distributed across multiple nodes and has high throughput I/O of many small files (size < 4 MB), you may perform faster if you write temporary files to local rather than global scratch. You can learn about different types of scratch and when to use them in the Storage section of the user guide. -
Build Smart:: Do not compile and install heavy software using login nodes
Heavy software installations belong in the build partition, not on the login nodes. Keep the system humming along smoothly by following the rules of the road. Sometimes software installation time can be dramatically reduced by using multiple cores. However, compute-intensive jobs are not permitted on login nodes and may be killed to provide equal opportunities for other connected users. Instead of login nodes, use a build partition dedicated to software installation.
By sidestepping these common missteps, you'll be on the fast track to success in no time. Happy computing!