DUNE Computing Training December 2021 edition

Workshop Welcome and Introduction


  What should I expect in participating in this workshop?

  • Introduce instructors and mentors.

  • Provide overview of the three schedule.

  • Spotlight helpful network provided by Slack channel.

DUNE Computing Consortium

The goal of the DUNE Computing Consortium is to establish a global computing network that can handle the massive data dumps DUNE will produce by distributing them across the grid. It coordinates all DUNE computing activities and provides to new members the documentation and training to acquaint them with the specific and DUNE software and resources.

Coordinator: Heidi Schellman (Oregon State University)

Tutorial Instructors


Lecturers (in order of appearance in the schedule):



The workshop is a one half-day version of the workshop that normally spans three days.

Opening Slides

The slides for the introduction of this tutorial can be found here, or as a PDF on the Indico site.

You can join DUNE's Slack: We created a special channel computing_training_dec2021 for technical support.

  • This workshop is brought to you by the DUNE Computing Consortium.

  • The goals are to give you the computing basis to work on DUNE.

Storage Spaces


  • What are the types and roles of DUNE’s data volumes?

  • What are the commands and tools to handle data?

  • Understanding the data volumes and their properties

  • Displaying volume information (total size, available size, mount point, device location)

  • Differentiating the commands to handle data between grid accessible and interactive volumes

There are three types of storage volumes that you will encounter at Fermilab: local hard drives, network attached storage, and large-scale, distributed storage. Each has it's own advantages and limitations, and knowing which one to use when isn't all straightforward or obvious. But with some amount of foresight, you can avoid some of the common pitfalls that have caught out other users.


What is immutable? Describing a file as immutable means that once that file is written to the volume it cannot be modified. It can only be read, moved, or deleted. A volume that only support immutable files is not a good choice for code or other files you want to change or edit often.

What is POSIX access? On interactive nodes, some volumes have POSIX access (Portable Operating System Interface Wikipedia) that allow users to directly read, write and modify using standard commands, e.g. vi, emacs, sed, or within the bash scripting language.

What is meant by 'grid accessible'? Volumes that are only grid accessible require specific tool suites to enable access to files stored there or to copy files to the storage volume. This will be explained in the following sections.

Interactive POSIX storage volumes (General Purpose Virual Machines)

Home area is similar to the user's local hard drive but network mounted

Locally mounted volumes are local physical disks, mounted directly on interactive node

Network Attached Storage (NAS) behaves similar to a locally mounted volume

Grid-accessible storage volumes

At Fermilab, an instance of dCache+Enstore is used for large-scale, distributed storage with capacity for more than 100 PB of storage and O(10000) connections. Whenever possible, these storage elements should be accessed over xrootd (see next section) as the mount points on interactive nodes are slow and unstable. Here are the different dCache volumes:

Persistent dCache: DO NOT USE THIS VOLUME TO DISTRIBUTE CODE TARBALLS!!! the data in the file is actively available for reads at any time and will not be removed until manually deleted by user. Quotas will be established in the near future.

Scratch dCache: large volume shared across all experiments. When a new file is written to scratch space, older files are removed in order to make room for the newer file. Removal is based on Least Recently Used policy.

Resilient dCache: handles custom user code for their grid jobs, often in the form of a tarball. Inappropriate to store any other files here. Deprecated and should instead use RCDS via CVMFS

Tape-backed dCache: disk based storage areas that have their contents mirrored to permanent storage on Enstore tape.
Files are not always available for immediate read from disk, but may need to be ‘staged’ from tape first. Checking file status before access is critical.

Summary on storage spaces

Full documentation: Understanding Storage Volumes

In the following table, <exp> stands for the experiment (uboone, nova, dune, etc…)

  Quota/Space Retention Policy Tape Backed? Retention Lifetime on disk Use for Path Grid Accessible
Persistent dCache No/~100 TB/exp Managed by Experiment No Until manually deleted immutable files w/ long lifetime NO CODE TARBALLS!!! /pnfs/<exp>/persistent Yes
Scratch dCache No/no limit LRU eviction - least recently used file deleted No Varies, ~30 days (NOT guaranteed) immutable files w/ short lifetime /pnfs/<exp>/scratch Yes
Resilient dCache No/no limit Periodic eviction if file not accessed No Approx 30 days (your experiment may have an active clean up policy) input tarballs with custom code for grid jobs (do NOT use for grid job outputs) /pnfs/<exp>/resilient Yes
Tape backed dCache No/O(10) PB LRU eviction (from disk) Yes Approx 30 days Long-term archive /pnfs/dune/… Yes
NAS Data Yes (~1 TB)/ 32+30 TB total Managed by Experiment No Till manually deleted Storing final analysis samples /dune/data No
NAS App Yes (~100 GB)/ ~15 TB total Managed by Experiment No Until manually deleted Storing and compiling software /dune/app No
Home Area (NFS mount) Yes (~10 GB) Centrally Managed by CCD No Until manually deleted Storing global environment scripts (All FNAL Exp) /nashome/<letter>/<uid> No

Commands and tools

This section will teach you the main tools and commands to display storage information and access data.

The df command

To find out what types of volumes are available on a node can be achieved with the command df. The -h is for human readable format. It will list a lot of information about each volume (total size, available size, mount point, device location).

df -h

Exercise 1

From the output of the df -h command, identify:

  1. the home area
  2. the NAS storage spaces
  3. the different dCache volumes


Another useful data handling command you will soon come across is ifdh. This stands for Intensity Frontier Data Handling. It is a tool suite that facilitates selecting the appropriate data transfer method from many possibilities while protecting shared resources from overload. You may see ifdhc, where c refers to client.

Here is an example to copy a file. Refer to the Mission Setup for the setting up the DUNETPC_VERSION.

Here is an example to copy a file. Refer to the Mission Setup for the setting up the DUNETPC_VERSION.

source /cvmfs/
setup dunetpc $DUNETPC_VERSION -q e19:prof #use DUNETPC_VERSION v09_22_02
setup_fnal_security /pnfs/dune/tape_backed/dunepro/physics/full-reconstructed/2019/mc/out1/PDSPProd2/22/60/37/10/PDSPProd2_protoDUNE_sp_reco_35ms_sce_off_23473772_0_452d9f89-a2a1-4680-ab72-853a3261da5d.root
ifdh cp root:// /dev/null

Resource: idfh commands

Exercise 2

Using the ifdh command, complete the following tasks:

  • create a directory in your dCache scratch area (/pnfs/dune/scratch/users/${USER}/) called “DUNE_tutorial_Dec2021”
  • copy your ~/.bashrc file to that directory.
  • copy the .bashrc file from your scrtach directory DUNE_tutorial_Dec2021 dCache to /dev/null
  • remove the directory DUNE_tutorial_Dec2021 using “ifdh rmdir /pnfs/dune/scratch/users/${USER}/DUNE_tutorial_Dec2021” Note, if the destination for an ifdh cp command is a directory instead of filename with full path, you have to add the “-D” option to the command line. Also, for a directory to be deleted, it must be empty.


The eXtended ROOT daemon is software framework designed for accessing data from various architectures and in a complete scalable way (in size and performance).

XRootD is most suitable for read-only data access. XRootD Man pages

Issue the following commands and try to understand how the first command enables completing the parameters for the second command.

pnfs2xrootd /pnfs/dune/scratch/users/${USER}/
xrdfs root:// ls /pnfs/${USER}/

Let's practice

Exercise 3

Using a combination of ifdh and xrootd commands discussed previously:

  • Use ifdh locateFile to find the directory for this file PDSPProd4a_protoDUNE_sp_reco_stage1_p1GeV_35ms_sce_off_43352322_0_20210427T162252Z.root
  • Use pnfs2xrootd to get the xrootd URI for that file.
  • Use xrdcp to copy that file to /dev/null
  • Using xrdfs and the ls option, count the number of files in the same directory as PDSPProd4a_protoDUNE_sp_reco_stage1_p1GeV_35ms_sce_off_43352322_0_20210427T162252Z.root

Note that redirecting the standard output of a command into the command wc -l will count the number of lines in the output text. e.g. ls -alrth ~/ | wc -l

  • Home directories are centrally managed by Computing Division and meant to store setup scripts and text files.

  • Home directories are NOT for storage of certificates or tokens.

  • Network attached storage (NAS) /dune/app is primarily for code development.

  • The NAS /dune/data is for store ntuples and small datasets.

  • dCache volumes (tape, resilient, scratch, persistent) offer large storage with various retention lifetime.

  • The tool suites idfh and XRootD allow for accessing data with appropriate transfer method and in a scalable way.

Data Management


  • What are the data management tools and software for DUNE?

  • How are different software versions handled?

  • What are the best data management practices?

  • Learn how to access data from DUNE Data Catalog

  • Understand the roles of the tools UPS, mrb and CVMFS

DUNE data is stored around the world and the storage elements are not always organized in a way that they can be easily inspected. For this purpose we use the SAM web client.

What is SAM?

Sequential Access via Metadata (SAM) is a data catalog originally designed for the D0 and CDF high energy physics experiments at Fermilab. It is now used by most of the Intensity Frontier experiments at Fermilab. The most important objects cataloged in SAM are individual files and collections of files called datasets.

Data files themselves are not stored in SAM, their metadata and physical locations are, and via metadata, you can search for and locate collections of files. SAM also provides mechanisms for initiating and tracking file delivery through projects.

This lecture will show you how to access data files that have been defined to the DUNE Data Catalog. Execute the following commands after logging in to the DUNE interactive node, and sourcing the main dune setups.

What is Rucio?

Rucio is the next-generation Data Replica service and is part of DUNE’s new Distributed Data Management (DDM) system that is currently in deployment. Rucio has two functions:

  1. A rule-based system to get files to Rucio Storage Elements (RSEs) around the world and keep them there for the lifeimte of the file.
  2. To return the “nearest” replica of any data file for use either in interactive or batch file use. It is expected that most DUNE users will not be regularly using direct Rucio commands, but other wrapper scripts that call them indirectly.

As of the date of this Dec 2021 tutorial:

rucio list-file-replicas protodune-sp:np04_raw_run005801_0001_dl1.root

rucio download protodune-sp:np04_raw_run005801_0001_dl1.root

rucio list-rses

Back to SAM

samweb client

samwebsamweb is the command line and python API that allows queries of the SAM metadata, creation of datasets and tools to track and deliver information to batch jobs.

samweb can be acquired from ups via:

source /cvmfs/
setup dunetpc $DUNETPC_VERSION -q e19:prof #use DUNETPC_VERSION v09_22_02

samweb allows you to select on a lot of parameters which are documented here:

This exercise will start you accessing data files that have been defined to the DUNE Data Catalog.

Creating a dataset

check to see if a file is on tape or disk

No ONLINE_AND_NEARLINE means you need to prestage that file. Unfortunately, prestaging requires a definition.

The official Protodune dataset definitions are here.

Resource: Using the SAM Data Catalog.

What is UPS and why do we need it?

An important requirement for making valid physics results is computational reproducibility. You need to be able to repeat the same calculations on the data and MC and get the same answers every time. You may be asked to produce a slightly different version of a plot for example, and the data that goes into it has to be the same every time you run the program.

This requirement is in tension with a rapidly-developing software environment, where many collaborators are constantly improving software and adding new features. We therefore require strict version control; the workflows must be stable and not constantly changing due to updates.

DUNE must provide installed binaries and associated files for every version of the software that anyone could be using. Users must then specify which version they want to run before they run it. All software dependencies must be set up with consistent versions in order for the whole stack to run and run reproducibly.

The Unix Product Setup (UPS) is a tool to handle the software product setup operation.

UPS is set up when you setup DUNE:

 source /cvmfs/

This sourcing defines the UPS setup command. Now to get DUNE’s LArSoft-based software, this is done through:

 setup dunetpc $DUNETPC_VERSION -q e19:prof

dunetpc: product name
$DUNETPC_VERSION version tag
e19:prof are “qualifiers”. Qualifiers are separated with colons and may be specified in any order. The “e19” qualifier refers to a specific version of the gcc compiler suite, and “prof” means select the installed product that has been compiled with optimizations turned on. An alternative to “prof” is the “debug” qualifier. All builds of LArSoft and dunetpc are compiled with debug symbols turned on, but the “debug” builds are made with optimizations turned off. Both kinds of software can be debugged, but it is easier to debug the debug builds (code executes in the proper order and variables aren’t optimized away so they can be inspected).

Another specifier of a product install is the “flavor”. This refers to the operating system the program was compiled for. These days we only support SL7, but in the past we used to also support SL6 and various versions of macOS. The flavor is automatically selected when you set up a product using setup (unless you override it which is usually a bad idea). Some product are “unflavored” because they do not contain anything that depends on the operating system. Examples are products that only contain data files or text files.

Setting up a UPS product defines many environment variables. Most products have an environment variable of the form <productname>_DIR, where <productname> is the name of the UPS product in all capital letters. This is the top-level directory and can be used when searching for installed source code or fcl files for example. <productname>_FQ_DIR is the one that specifies a particular qualifier and flavor.

Exercise 3

  • show all the versions of dunetpc that are currently available by using the “ups list -aK+ dunetpc” command
  • pick one version and substitute that for DUNETPC_VERSION above and set up dunetpc

Many products modify the following search path variables, prepending their pieces when set up. These search paths are needed by art jobs.

PATH: colon-separated list of directories the shell uses when searching for programs to execute when you type their names at the command line. The command “which” tells you which version of a program is found first in the PATH search list. Example:

which lar

will tell you where the lar command you would execute is if you were to type “lar” at the command prompt. The other paths are needed by art for finding plug-in libraries, fcl files, and other components, like gdml files.

Also the PYTHONPATH describes where Python modules will be loaded from.

What is mrb and why do we need it?
Early on, the LArSoft team chose git and cmake as the software version manager and the build language, respectively, to keep up with industry standards and to take advantage of their new features. When we clone a git repository to a local copy and check out the code, we end up building it all. We would like LArSoft and DUNE code to be more modular, or at least the builds should reflect some of the inherent modularity of the code.

Ideally, we would like to only have to recompile a fraction of the software stack when we make a change. The granularity of the build in LArSoft and other art-based projects is the repository. So LArSoft and DUNE have divided code up into multiple repositories (DUNE ought to divide more than it has, but there are a few repositories already with different purposes). Sometimes one needs to modify code in multiple repositories at the same time for a particular project. This is where mrb comes in.

mrb stands for “multi-repository build”. mrb has features for cloning git repositories, setting up build and local products environments, building code, and checking for consistency (i.e. there are not two modules with the same name or two fcl files with the same name). mrb builds UPS products – when it installs the built code into the localProducts directory, it also makes the necessasry UPS table files and .version directories. mrb also has a tool for making a tarball of a build product for distribution to the grid. The software build example later in this tutorial exercises some of the features of mrb.

Link to the mrb reference guide

What is CVMFS and why do we
DUNE has a need to distribute precompiled code to many different computers that collaborators may use. Installed products are needed for four things:

  1. Running programs interactively
  2. Running programs on grid nodes
  3. Linking programs to installed libraries
  4. Inspection of source code and data files

Results must be reproducible, so identical code and associated files must be distributed everywhere. DUNE does not own any batch resources – we use CPU time on computers that participating institutions donate to the Open Science Grid. We are not allowed to install our software on these computers and must return them to their original state when our programs finish running so they are ready for the next job from another collaboration.

CVMFS is a perfect tool for distributing software and related files. It stands for CernVM File System (VM is Virtual Machine). Local caches are provided on each target computer, and files are accessed via the /cvmfs mount point. DUNE software is in the directory /cvmfs/, and LArSoft code is in /cvmfs/ These directories are auto-mounted and need to be visible when one executes ls /cvmfs for the first time. Some software is also in /cvmfs/

CVMFS also provides a de-duplication feature. If a given file is the same in all 100 releases of dunetpc, it is only cached and transmitted once, not independently for every release. So it considerably decreases the size of code that has to be transferred.

When a file is accessed in /cvmfs, a daemon on the target computer wakes up and determines if the file is in the local cache, and delivers it if it is. If not, the daemon contacts the CVMFS repository server responsible for the directory, and fetches the file into local cache. In this sense, it works a lot like AFS. But it is a read-only filesystem on the target computers, and files must be published on special CVMFS publishing servers. Files may also be cached in a layer between the CVMFS host and the target node in a squid server, which helps facilities with many batch workers reduce the network load in fetching many copies of the same file, possibly over an international connection.

CVMFS also has a feature known as “Stashcache” or “xCache”. Files that are in /cvmfs/ are not actually transmitted in their entirety, only pointers to them are, and then they are fetched from one of several regional cache servers or in the case of DUNE from Fermilab dCache directly. DUNE uses this to distribute photon library files, for instance.

CVMFS is by its nature read-all so code is readable by anyone in the world with a CVMFS client. CVMFS clients are available for download to desktops or laptops. Sensitive code can not be stored in CVMFS.

More information on CVMFS is available here

Exercise 6

  • cd /cvmfs and do an ls at top level
  • What do you see–do you see the four subdirectories (,,, and
  • cd

  • SAM and Rucio are data handling systems used by the DUNE collaboration to retrieve data.

  • Staging is a necessary step to make sure files are on disk in dCache (as opposed to only on tape).

  • Xrootd allows user to stream data file.

  • The Unix Product Setup (UPS) is a tool to ensure consistency between different software versions and reproducibility.

  • The multi-repository build (mrb) tool allows code modification in multiple repositories, which is relevant for a large project like LArSoft with different cases (end user and developers) demanding consistency between the builds.

  • CVMFS distributes software and related files without installing them on the target computer (using a VM, Virtual Machine).

Submit a job

Note that job submission requires FNAL account but can be done from a CERN machine, or any other with CVMFS access.

First, log in to a dunegpvm machine (should work from lxplus too with a minor extra step of getting a Fermilab Kerberos ticket on lxplus via kinit). Then you will need to set up the job submission tools (jobsub). If you set up dunetpc it will be included, but if not, you need to do

source /cvmfs/
setup jobsub_client

Having done that, let us submit a prepared script:

jobsub_submit -G dune -M -N 1 --memory=1000MB --disk=1GB --cpu=1 --expected-lifetime=1h --resource-provides=usage_model=DEDICATED,OPPORTUNISTIC,OFFSITE -l '+SingularityImage=\"/cvmfs/\"' --append_condor_requirements='(TARGET.HAS_CVMFS_dune_opensciencegrid_org==true&&TARGET.HAS_CVMFS_larsoft_opensciencegrid_org==true&&TARGET.CVMFS_dune_opensciencegrid_org_REVISION>=1105)' file:///dune/app/users/kherner/

If all goes well you should see something like this:

Submitting job(s).
1 job(s) submitted to cluster 40351757.
JobsubJobId of first job:
Use job id to retrieve output


  1. What is your job ID?

Now, let’s look at some of these options in more detail.

Job Output

This particular test writes a file to /pnfs/dune/scratch/users/<username>/job_output_<id number>.log. Verify that the file exists and is non-zero size after the job completes. You can delete it after that; it just prints out some information about the environment.

More information about jobsub is available here and here.

Submit a job using the tarball containing custom code (left as an exercise)

First off, a very important point: for running analysis jobs, you may not actually need to pass an input tarball, especially if you are just using code from the base release and you don’t actually modify any of it. All you need to do is set up any required software from CVMFS (e.g. dunetpc and/or protoduneana), and you are ready to go. If you’re just modifying a fcl file, for example, but no code, it’s actually more efficient to copy just the fcl(s) your changing to the scratch directory within the job, and edit them as part of your job script (copies of a fcl file in the current working directory have priority over others by default).

Sometimes, though, we need to run some custom code that isn’t in a release. We need a way to efficiently get code into jobs without overwhelming our data transfer systems. We have to make a few minor changes to the scripts you made in the previous tutorial section, generate a tarball, and invoke the proper jobsub options to get that into your job. There are many ways of doing this but by far the best is to use the Rapid Code Distribution Service (RCDS), as shown in our example.

If you have finished up the LArSoft follow-up and want to use your own code for this next attempt, feel free to tar it up (you don’t need anything besides the localProducts* and work directories) and use your own tar ball in lieu of the one in this example. You will have to change the last line with your own submit file instead of the pre-made one.

First, we should make a tarball. Here is what we can do (assuming you are starting from /dune/app/users/username/):

cp /dune/app/users/kherner/ /dune/app/users/username/
cp /dune/app/users/kherner/dec2021tutorial/localProducts_larsoft__e19_prof/setup-grid /dune/app/users/username/dec2021tutorial/localProducts_larsoft__e19_prof/setup-grid

Before we continue, let’s examine these files a bit. We will source the first one in our job script, and it will set up the environment for us.


# we cannot rely on "whoami" in a grid job. We have no idea what the local username will be.
# Use the GRID_USER environment variable instead (set automatically by jobsub). 

source /cvmfs/
export WORKDIR=${_CONDOR_JOB_IWD} # if we use the RCDS the our tarball will be placed in $INPUT_TAR_DIR_LOCAL.
if [ ! -d "$WORKDIR" ]; then
  export WORKDIR=`echo .`

source ${INPUT_TAR_DIR_LOCAL}/${DIRECTORY}/localProducts*/setup-grid 

Now let’s look at the difference between the setup-grid script and the plain setup script. Assuming you are currently in the /dune/app/users/username directory:

diff may2021tutorial/localProducts_larsoft__e19_prof/setup may2021tutorial/localProducts_larsoft__e19_prof/setup-grid
< setenv MRB_TOP "/dune/app/users/<username>/may2021tutorial"
< setenv MRB_TOP_BUILD "/dune/app/users/<username>/may2021tutorial"
< setenv MRB_SOURCE "/dune/app/users/<username>/may2021tutorial/srcs"
< setenv MRB_INSTALL "/dune/app/users/<username>/may2021tutorial/localProducts_larsoft__e19_prof"
> setenv MRB_TOP "${INPUT_TAR_DIR_LOCAL}/may2021tutorial"
> setenv MRB_TOP_BUILD "${INPUT_TAR_DIR_LOCAL}/may2021tutorial"
> setenv MRB_SOURCE "${INPUT_TAR_DIR_LOCAL}/may2021tutorial/srcs"
> setenv MRB_INSTALL "${INPUT_TAR_DIR_LOCAL}/may2021tutorial/localProducts_larsoft__e19_prof"

As you can see, we have switched from the hard-coded directories to directories defined by environment variables; the INPUT_TAR_DIR_LOCAL variable will be set for us (see below). Now, let’s actually create our tar file. Again assuming you are in /dune/app/users/kherner/may2021tutorial/:

tar --exclude '.git' -czf may2021tutorial.tar.gz may2021tutorial/localProducts_larsoft__e19_prof may2021tutorial/work

Then submit another job (in the following we keep the same submit file as above):

jobsub_submit -G dune -M -N 1 --memory=1800MB --disk=2GB --expected-lifetime=3h --cpu=1 --resource-provides=usage_model=DEDICATED,OPPORTUNISTIC,OFFSITE --tar_file_name=dropbox:///dune/app/users/<username>/dec2021tutorial.tar.gz --use-cvmfs-dropbox -l '+SingularityImage=\"/cvmfs/\"' --append_condor_requirements='(TARGET.HAS_Singularity==true&&
TARGET.HAS_CVMFS_fifeuser4_opensciencegrid_org==true)' file:///dune/app/users/kherner/

You’ll see this is very similar to the previous case, but there are some new options:

Now, there’s a very small gotcha when using the RCDS, and that is when your job runs, the files in the unzipped tarball are actually placed in your work area as symlinks from the CVMFS version of the file (which is what you want since the whole point is not to have N different copies of everything). The catch is that if your job script expected to be able to edit one or more of those files within the job, it won’t work because the link is to a read-only area. Fortunately there’s a very simple trick you can do in your script before trying to edit any such files:

cp ${INPUT_TAR_DIR_LOCAL}/file_I_want_to_edit mytmpfile  # do a cp, not mv
rm ${INPUT_TAR_DIR_LOCAL}file_I_want_to_edit # This really just removes the link
mv mytmpfile file_I_want_to_edit # now it's available as an editable regular file.

You certainly don’t want to do this for every file, but for a handful of small text files this is perfectly acceptable and the overall benefits of copying in code via the RCDS far outweigh this small cost. This can get a little complicated when trying to do it for things several directories down, so it’s easiest to have such files in the top level of your tar file.

View the stdout/stderr of our jobs

Here’s the link for the history page of the example job: link.

Feel free to sub in the link for your own jobs.

Once there, click “View Sandbox files (job logs)”. In general you want the .out and .err files for stdout and stderr. The .cmd file can sometimes be useful to see exactly what got passed in to your job.

Kibana can also provide a lot of information.

You can also download the job logs from the command line with jobsub_fetchlog:

jobsub_fetchlog --unzipdir=some_appropriately_named_directory

That will download them as a tarball and unzip it into the directory specified by the –unzipdir option. Of course replace with your own job ID.


Download the log of your last submission via jobsub_fetchlog or look it up on the monitoring pages. Then answer the following questions (all should be available in the .out or .err files):

  1. On what site did your job run?
  2. How much memory did it use?
  3. Did it exit abnormally? If so, what was the exit code?

(Time permitting) submit with POMS

POMS is the recommended way of submitting large workflows. It offers several advantages over other systems, such as

At its core, in POMS one makes a “campaign”, which has one or more “stages”. In our example there is only a single stage.

For analysis use: main POMS page
An example campaign.

Typical POMS use centers around a configuration file (often more like a template which can be reused for many campaigns) and various campaign-specific settings for overriding the defaults in the config file. An example config file designed to do more or less what we did in the previous submission is here: /dune/app/users/kherner/may2021tutorial/work/pomsdemo.cfg

You can find more about POMS here: POMS User Documentation
Helpful ideas for structuring your config files are here: Fife launch Reference

When you start using POMS you must upload an x509 proxy to the sever before submitting. The best way to do that is to set up the poms_client UPS product and then use the upload_file command after you have generated your proxy:

voms-proxy-init -rfc -noregen -voms dune:/dune/Role=Analysis -valid 120:00
upload_file --experiment dune --proxy

Finally, here is an example of a campaign that does the same thing as the previous one, using our usual MC reco file from Prod2, but does it via making a SAM dataset using that as the input: POMS campaign stage information. Of course, before running any SAM project, we should prestage our input definition(s):

samweb prestage-dataset kherner-may2021tutorial-mc

replacing the above definition with your own definition as appropriate.

If you are used to using other programs for your work such as, there is a helpful tool called Project-py that you can use to convert existing xml into POMS configs, so you don’t need to start from scratch! Then you can just switch to using POMS from that point forward.

  • When in doubt, ask! Understand that policies and procedures that seem annoying, overly complicated, or unnecessary (especially when compared to running an interactive test) are there to ensure efficient operation and scalability. They are also often the result of someone breaking something in the past, or of simpler approaches not scaling well.

  • Send test jobs after creating new workflows or making changes to existing ones. If things don’t work, don’t blindly resubmit and expect things to magically work the next time.

  • Only copy what you need in input tar files. In particular, avoid copying log files, .git directories, temporary files, etc. from interactive areas.

  • Take care to follow best practices when setting up input and output file locations.

  • Always, always, always prestage input datasets. No exceptions.

Three Days of Training Collapsed into One Half Day

The instruction in this half day workshop is provided by several experienced physicists and is based on years of collaborative experience.

The secure access to Fermilab computing systems and a familiarity with data storage are key components.

Data management and event processing tools were described and modeled.

Protocols for efficient job submission and monitoring has been demonstrated.

We are thankful for the instructor’s hard work, and for the numerous participants who joined.

