Grid Job Submission and Common Errors

Overview

Teaching: 130 min
Exercises: 0 min

Questions

How to submit grid jobs?

Objectives

Submit a job and understand what’s happening behind the scenes

Monitor the job and look at its outputs

Review best practices for submitting jobs (including what NOT to do)

Extension; submit a small job with POMS

Video Session

The session was captured for your asynchronous review.

Submit a job

Note that job submission requires FNAL account but can be done from a CERN machine, or any other with CVMFS access.

First, log in to a dunegpvm machine (should work from lxplus too with a minor extra step of getting a Fermilab Kerberos ticket on lxplus via kinit). Then you will need to set up the job submission tools (jobsub). If you set up dunetpc it will be included, but if not, you need to do

source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh
setup jobsub_client
mkdir -p /pnfs/dune/scratch/users/${USER}/DUNE_tutorial_May2022 # if you have not done this before

Having done that, let us submit a prepared script:

jobsub_submit -G dune -M -N 1 --memory=1000MB --disk=1GB --cpu=1 --expected-lifetime=1h --resource-provides=usage_model=DEDICATED,OPPORTUNISTIC,OFFSITE -l '+SingularityImage=\"/cvmfs/singularity.opensciencegrid.org/fermilab/fnal-wn-sl7:latest\"' --append_condor_requirements='(TARGET.HAS_Singularity==true&&TARGET.HAS_CVMFS_dune_opensciencegrid_org==true&&TARGET.HAS_CVMFS_larsoft_opensciencegrid_org==true&&TARGET.CVMFS_dune_opensciencegrid_org_REVISION>=1105)' file:///dune/app/users/kherner/submission_test_singularity.sh

If all goes well you should see something like this:

/fife/local/scratch/uploads/dune/kherner/2022-05-11_151253.446116_5339
/fife/local/scratch/uploads/dune/kherner/2022-05-11_151253.446116_5339/submission_test_singularity.sh_20220511_151254_1116939_0_1_.cmd
submitting....
Submitting job(s).
1 job(s) submitted to cluster 32496605.
JobsubJobId of first job: 32496605.0@jobsub03.fnal.gov
Use job id 32496605.0@jobsub03.fnal.gov to retrieve output

Quiz

What is your job ID?

Now, let’s look at some of these options in more detail.

-M sends mail after the job completes whether it was successful for not. The default is email only on error. To disable all emails, use --mail_never.
-N controls the number of identical jobs submitted with each cluster. Also called the process ID, the number ranges from 0 to N-1 and forms the part of the job ID number after the period, e.g. 12345678.N.
--memory, --disk, --cpu, --expected-lifetime request this much memory, disk, number of cpus, and max run time. Jobs that exceed the requested amounts will go into a held state. Defaults are 2000 MB, 10 GB, 1, and 8h, respectively. Note that jobs are charged against the DUNE FermiGrid quota according to the greater of memory/2000 MB and number of CPUs, with fractional values possible. For example, a 3000 MB request is charged 1.5 “slots”, and 4000 MB would be charged 2. You are charged for the amount requested, not what is actually used, so you should not request any more than you actually need (your jobs will also take longer to start the more resources you request). Note also that jobs that run offsite do NOT count against the FermiGrid quota. In general, aim for memory and run time requests that will cover 90-95% of your jobs and use the autorelease feature to deal with the remainder.
--resource-provides=usage_model This controls where jobs are allowed to run. DEDICATED means use the DUNE FermiGrid quota, OPPORTUNISTIC means use idle FermiGrid resources beyond the DUNE quota if they are available, and OFFSITE means use non-Fermilab resources. You can combine them in a comma-separated list. In nearly all cases you should be setting this to DEDICATED,OPPORTUNISTIC,OFFSITE. This ensures maximum resource availability and will get your jobs started the fastest. Note that because of Singularity, there is absolutely no difference between the environment on Fermilab worker nodes and any other place. Depending on where your input data are (if any), you might see slight differences in network latency, but that’s it.
-l (or --lines=) allows you to pass additional arbitrary HTCondor-style classad variables into the job. In this case, we’re specifying exactly what Singularity image we want to use in the job. It will be automatically set up for us when the job starts. Any other valid HTCondor classad is possible. In practice you don’t have to do much beyond the Singularity image. Here, pay particular attention to the quotes and backslashes.
--append_condor_requirements allows you to pass additional HTCondor-style requirements to your job. This helps ensure that your jobs don’t start on a worker node that might be missing something you need (a corrupt or out of date CVMFS repository, for example). Some checks run at startup for a variety of CVMFS repositories. Here, we check that Singularity invocation is working and that the CVMFS repos we need ( [dune.opensciencegrid.org][dune-openscience-grid-org] and [larsoft.opensciencegrid.org][larsoft-openscience-grid-org] ) are in working order. Optionally you can also place version requirements on CVMFS repos (as we did here as an example), useful in case you want to use software that was published very recently and may not have rolled out everywhere yet.

Job Output

This particular test writes a file to /pnfs/dune/scratch/users/<username>/job_output_<id number>.log. Verify that the file exists and is non-zero size after the job completes. You can delete it after that; it just prints out some information about the environment.

More information about jobsub is available here and here.

Manipulating submitted jobs

If you want to remove existing jobs, you can do

jobsub_rm -G dune --jobid=12345678.9@jobsub0N.fnal.gov

to remove all jobs in a given submission (i.e. if you used -N <some number greater than 1>) you can do

jobsub_rm -G dune --jobid=12345678@jobsub0N.fnal.gov

To remove all of your jobs, you can do

jobsub_rm -G dune --user=username

If you want to manipulate only a certian subset of jobs, you can use a HTCondor-style constraint. For example, if I want to remove only held jobs asking for more than say 8 GB of memory that went held because they went over their request, I could do something like

jobsub_rm -G dune --constraint='Owner=="username"&&JobStatus==5&&RequestMemory>=8000&&(HoldReasonCode==34||(HoldReasonCode==26&&HoldReasonSubCode==1))'

To hold jobs, it’s the same procedure as jobsub_rm; just replace that with jobsub_hold. To release a held job (which will restart from the beginning), it’s the same commands as above, only use jobsub_release in place of rm or hold.

if you get tired of typing -G dune all the time, you can set the JOBSUB_GROUP environment variable to dune, and then omit the -G option.

Submit a job using the tarball containing custom code

First off, a very important point: for running analysis jobs, you may not actually need to pass an input tarball, especially if you are just using code from the base release and you don’t actually modify any of it. All you need to do is set up any required software from CVMFS (e.g. dunetpc and/or protoduneana), and you are ready to go. If you’re just modifying a fcl file, for example, but no code, it’s actually more efficient to copy just the fcl(s) your changing to the scratch directory within the job, and edit them as part of your job script (copies of a fcl file in the current working directory have priority over others by default).

Sometimes, though, we need to run some custom code that isn’t in a release. We need a way to efficiently get code into jobs without overwhelming our data transfer systems. We have to make a few minor changes to the scripts you made in the previous tutorial section, generate a tarball, and invoke the proper jobsub options to get that into your job. There are many ways of doing this but by far the best is to use the Rapid Code Distribution Service (RCDS), as shown in our example.

If you have finished up the LArSoft follow-up and want to use your own code for this next attempt, feel free to tar it up (you don’t need anything besides the localProducts* and work directories) and use your own tar ball in lieu of the one in this example. You will have to change the last line with your own submit file instead of the pre-made one.

First, we should make a tarball. Here is what we can do (assuming you are starting from /dune/app/users/username/):

cp /dune/app/users/kherner/setupMay2022Tutorial-grid.sh /dune/app/users/${USER}/
cp /dune/app/users/kherner/may2022tutorial/localProducts_larsoft_v09_48_01_e20_prof/setup-grid /dune/app/users/${USER}/may2022tutorial/localProducts_larsoft_v09_48_01_e20_prof/setup-grid

Before we continue, let’s examine these files a bit. We will source the first one in our job script, and it will set up the environment for us.

#!/bin/bash                                                                                                                                                                                                      

DIRECTORY=may2022tutorial
# we cannot rely on "whoami" in a grid job. We have no idea what the local username will be.
# Use the GRID_USER environment variable instead (set automatically by jobsub). 
USERNAME=${GRID_USER}

source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh
export WORKDIR=${_CONDOR_JOB_IWD} # if we use the RCDS the our tarball will be placed in $INPUT_TAR_DIR_LOCAL.
if [ ! -d "$WORKDIR" ]; then
  export WORKDIR=`echo .`
fi

source ${INPUT_TAR_DIR_LOCAL}/${DIRECTORY}/localProducts*/setup-grid 
mrbslp

Now let’s look at the difference between the setup-grid script and the plain setup script. Assuming you are currently in the /dune/app/users/username directory:

diff may2022tutorial/localProducts_larsoft_v09_48_01_e20_prof/setup may2022tutorial/localProducts_larsoft_v09_48_01_e20_prof/setup-grid

< setenv MRB_TOP "/dune/app/users/<username>/may2022tutorial"
< setenv MRB_TOP_BUILD "/dune/app/users/<username>/may2022tutorial"
< setenv MRB_SOURCE "/dune/app/users/<username>/may2022tutorial/srcs"
< setenv MRB_INSTALL "/dune/app/users/<username>/may2022tutorial/localProducts_larsoft_v09_48_01_e20_prof"
---
> setenv MRB_TOP "${INPUT_TAR_DIR_LOCAL}/may2022tutorial"
> setenv MRB_TOP_BUILD "${INPUT_TAR_DIR_LOCAL}/may2022tutorial"
> setenv MRB_SOURCE "${INPUT_TAR_DIR_LOCAL}/may2022tutorial/srcs"
> setenv MRB_INSTALL "${INPUT_TAR_DIR_LOCAL}/may2022tutorial/localProducts_larsoft_v09_48_01_e20_prof"

As you can see, we have switched from the hard-coded directories to directories defined by environment variables; the INPUT_TAR_DIR_LOCAL variable will be set for us (see below). Now, let’s actually create our tar file. Again assuming you are in /dune/app/users/kherner/may2022tutorial/:

tar --exclude '.git' -czf may2022tutorial.tar.gz may2022tutorial/localProducts_larsoft_v09_48_01_e20_prof may2022tutorial/work setupMay2022Tutorial-grid.sh

Then submit another job (in the following we keep the same submit file as above):

jobsub_submit -G dune -M -N 1 --memory=2500MB --disk=2GB --expected-lifetime=3h --cpu=1 --resource-provides=usage_model=DEDICATED,OPPORTUNISTIC,OFFSITE --tar_file_name=dropbox:///dune/app/users/<username>/may2022tutorial.tar.gz -l '+SingularityImage=\"/cvmfs/singularity.opensciencegrid.org/fermilab/fnal-wn-sl7:latest\"' --append_condor_requirements='(TARGET.HAS_Singularity==true&&TARGET.HAS_CVMFS_dune_opensciencegrid_org==true&&TARGET.HAS_CVMFS_larsoft_opensciencegrid_org==true&&TARGET.CVMFS_dune_opensciencegrid_org_REVISION>=1105&&TARGET.HAS_CVMFS_fifeuser1_opensciencegrid_org==true&&TARGET.HAS_CVMFS_fifeuser2_opensciencegrid_org==true&&TARGET.HAS_CVMFS_fifeuser3_opensciencegrid_org==true&&TARGET.HAS_CVMFS_fifeuser4_opensciencegrid_org==true)' file:///dune/app/users/kherner/run_may2022tutorial.sh

You’ll see this is very similar to the previous case, but there are some new options:

--tar_file_name=dropbox:// automatically copies and untars the given tarball into a directory on the worker node, accessed via the INPUT_TAR_DIR_LOCAL environment variable in the job. As of now, only one such tarball can be specified. If you need to copy additional files into your job that are not in the main tarball you can use the -f option (see the jobsub manual for details). The value of INPUT_TAR_DIR_LOCAL is by default $CONDOR_DIR_INPUT/name_of_tar_file, so if you have a tar file named e.g. may2022tutorial.tar.gz, it would be $CONDOR_DIR_INPUT/may2022tutorial.
Notice that the --append_condor_requirements line is longer now, because we also check for the fifeuser[1-4]. opensciencegrid.org CVMFS repositories.

Now, there’s a very small gotcha when using the RCDS, and that is when your job runs, the files in the unzipped tarball are actually placed in your work area as symlinks from the CVMFS version of the file (which is what you want since the whole point is not to have N different copies of everything). The catch is that if your job script expected to be able to edit one or more of those files within the job, it won’t work because the link is to a read-only area. Fortunately there’s a very simple trick you can do in your script before trying to edit any such files:

cp ${INPUT_TAR_DIR_LOCAL}/file_I_want_to_edit mytmpfile  # do a cp, not mv
rm ${INPUT_TAR_DIR_LOCAL}file_I_want_to_edit # This really just removes the link
mv mytmpfile file_I_want_to_edit # now it's available as an editable regular file.

You certainly don’t want to do this for every file, but for a handful of small text files this is perfectly acceptable and the overall benefits of copying in code via the RCDS far outweigh this small cost. This can get a little complicated when trying to do it for things several directories down, so it’s easiest to have such files in the top level of your tar file.

Monitor your jobs

For all links below, log in with your FNAL Services credentials (FNAL email, not Kerberos password).

What DUNE is doing overall:
https://fifemon.fnal.gov/monitor/d/000000053/experiment-batch-details?orgId=1&var-experiment=dune
What’s going on with only your jobs:
Remember to change the url with your own username and adjust the time range to cover the region of interest. https://fifemon.fnal.gov/monitor/d/000000116/user-batch-details?orgId=1&var-cluster=fifebatch&var-user=kherner
Why your jobs are held:
Remember to choose your username in the upper left.
https://fifemon.fnal.gov/monitor/d/000000146/why-are-my-jobs-held?orgId=1

View the stdout/stderr of our jobs

Here’s the link for the history page of the example job: link.

Feel free to sub in the link for your own jobs.

Once there, click “View Sandbox files (job logs)”. In general you want the .out and .err files for stdout and stderr. The .cmd file can sometimes be useful to see exactly what got passed in to your job.

Kibana can also provide a lot of information.

You can also download the job logs from the command line with jobsub_fetchlog:

jobsub_fetchlog --jobid=12345678.0@jobsub0N.fnal.gov --unzipdir=some_appropriately_named_directory

That will download them as a tarball and unzip it into the directory specified by the –unzipdir option. Of course replace 12345678.0@jobsub0N.fnal.gov with your own job ID.

Quiz

Download the log of your last submission via jobsub_fetchlog or look it up on the monitoring pages. Then answer the following questions (all should be available in the .out or .err files):

On what site did your job run?

How much memory did it use?

Did it exit abnormally? If so, what was the exit code?

Brief review of best practices in grid jobs (and a bit on the interactive machines)

When creating a new workflow or making changes to an existing one, ALWAYS test with a single job first. Then go up to 10, etc. Don’t submit thousands of jobs immediately and expect things to work.
ALWAYS be sure to prestage your input datasets before launching large sets of jobs.
Use RCDS; do not copy tarballs from places like scratch dCache. There’s a finite amount of transfer bandwidth available from each dCache pool. If you absolutely cannot use RCDS for a given file, it’s better to put it in resilient (but be sure to remove it when you’re done!). The same goes for copying files from within your own job script: if you have a large number of jobs looking for a same file, get it from resilient. Remove the copy when no longer needed. Files in resilient dCache that go unaccessed for 45 days are automatically removed.
Be careful about placing your output files. NEVER place more than a few thousand files into any one directory inside dCache. That goes for all type of dCache (scratch, persistent, resilient, etc).
Avoid commands like ifdh ls /path/with/wildcards/*/ inside grid jobs. That is a VERY expensive operation and can cause a lot of pain for many users.
Use xrootd when opening files interactively; this is much more stable than simply doing root /pnfs/dune/...
NEVER copy job outputs to a directory in resilient dCache. Remember that they are replicated by a factor of 20! Any such files are subject to deletion without warning.
NEVER do hadd on files in /pnfs areas unless you’re using xrootd. I.e. do NOT do hadd out.root /pnfs/dune/file1 /pnfs/dune/file2 ... This can cause severe performance degradations.

Submitting with POMS, Part I

POMS is the recommended (i.e., supported) way of submitting large workflows. It offers several advantages over other systems, such as

Fully configurable. Any executables can be run, not necessarily only lar or art
Automatic monitoring and campaign management options
Multi-stage workflow dependencies, automatic dataset creation between stages
Automated recovery options

At its core, in POMS one makes a “campaign”, which has one or more “stages”. In our example there is only a single stage.

For analysis use: main POMS page
An example campaign.

Typical POMS use centers around a configuration file (often more like a template which can be reused for many campaigns) and various campaign-specific settings for overriding the defaults in the config file. An example config file designed to do more or less what we did in the previous submission is here: /dune/app/users/kherner/may2022tutorial/work/pomsdemo.cfg

You can find more about POMS here: POMS User Documentation
Helpful ideas for structuring your config files are here: Fife launch Reference

When you start using POMS you must upload an x509 proxy to the sever before submitting and you need to periodically repeat this as the proxy gets close to expoiration (typically once every few days). If uploading it manually, it must be named x509up_voms_dune_Analysis_yourusername when you upload it. To upload, look for the User Data item in the left-hand menu on the POMS site, choose Uploaded Files, and follow the instructions.

By far the easiest way to upload the proxy, however, is to use the upload_file command from the fife_utils package. To do that, simply first set up fife_utils if not already done:

setup -g analysis fife_utils

And then run the upload file command:

upload_file --experiment=dune --proxy

Typical output will look something like

Fetching options from https://fifebatch.fnal.gov/cigetcertopts.txt
Checking if /tmp/x509up_uNNNNN has at least 1.0 hours left
117.19 hours remaining, enough to reuse
Checking if myproxy.fnal.gov has at least 503.0 hours left
663.62 hours remaining, enough to reuse
uploaded: /tmp/x509up_voms_dune_Analysis_kherner to POMS server

where NNNNN will be your UID. As you see, the utility will automatically upload a proxy and give it the proper name for you.

Here is an example of a campaign that does the same thing as the previous one, using some MC reco files from ProtoDUNE SP Prod4a, but does it via making a SAM dataset using that as the input: POMS campaign stage information. Of course, before running any SAM project, we should prestage our input definition(s). The way most people do that is to do

samweb prestage-dataset kherner-may2022tutorial-mc

replacing the above definition with your own definition as appropriate. However, this does NOT reset the clock on the LRU algorithm, because if prestage-dataset sees a file is already chached, it goes on to the next one; it does no lifetime or last access checking, nor does it read the file. A better way to prestage is to instead do

unsetup curl # necessary as of May 2022 because there's an odd interaction with the UPS version of curl, so we need to turn it off
samweb run-project --defname=kherner-may2022tutorial-mc --schema https 'echo %fileurl && curl -L --cert $X509_USER_PROXY --key $X509_USER_PROXY --cacert $X509_USER_PROXY --capath /etc/grid-security/certificates -H "Range: bytes=0-3" %fileurl && echo'

This reads the first four bytes of each file, which will reset the LRU clock. Note you will need to have the X509_USER_PROXY environment variable set. Most of the time that will simply be set as

export X509_USER_PROXY=/tmp/x509up_u$(id -u)

if you ever find yourself doing work under a shared account (dunepro for example) you should NOT manually set X509_USER_PROXY in this way.

Side note: Some people will pass file lists to their jobs instead of using a SAM dataset. We do not recommend that for two reasons: 1) Lists do not protect you from cases where files fall out of cache at the location(s) in your list. When that happens your jobs sit idle waiting for the files to be fetched from tape, which kills your efficiency and blocks resources for others. 2) You miss out on cases where there might be a local copy of the file at the site you’re running on, or at least at closer one to your list. So you may end up unecessarily streaming across oceans, whereas using SAM (or later Rucio) will find you closer, local copies when they exist.

Another important side note: If you are used to using other programs for your work such as project.py (which is NOT officially supported by DUNE or the Fermilab Scientific Computing Division), there is a helpful tool called Project-py that you can use to convert existing xml into POMS configs, so you don’t need to start from scratch! Then you can just switch to using POMS from that point forward. As a reminder, if you use unsupported tools, you are own your own and will receive NO SUPPORT WHATSOEVER. You are still responsible for making sure that your jobs satisfy Fermilab’s policy for job efficiency: https://cd-docdb.fnal.gov/cgi-bin/sso/RetrieveFile?docid=7045&filename=FIFE_User_activity_mitigation_policy_20200625.pdf&version=1

previous episode

DUNE Computing Training May 2022 edition

next episode