This lesson is being piloted (Beta version)

LArSoft Basics for DUNE - 2025 edition

Online Tutorial Welcome and Introduction

Overview

Teaching: 5 min
Exercises: 0 min
Questions
  • What should I expect in participating in this Tutorial?

Objectives
  • Introduce instructors and mentors.

  • Provide overview of the modules

  • Spotlight helpful network provided by Slack channel.

DUNE Computing Consortium

The DUNE Computing Consortium works to establish a global computing network that will handle the massive data streams produced by distributing these across the computing grid. Selected consortia members coordinate DUNE computing activities to new members to acquaint them with the specific and DUNE software and resources.

DUNE Computing Consortium Coordinator: Michael Kirby (Brookhave National Laboratory)

This is a short 3 hour version of the basics. We will be adding/offering additional tutorials. An important one that is coming soon is:

The LArSoft tutorial at CERN, February 3-7, 2025 password on the tutorials page

Also check out the longer list of DUNE computing tutorials (collaborators only)

Workshop Introduction Video from December 2024

Basic setup reminder

You should have gone through the setup sequence

As a reminder you need to choose between running on sl7 in a container or al9. You do NOT want to mix them.

You also need to be starting in a clean terminal session. We recommend not having a .profile or .login at all and deliberately creating setup scripts that you source whenever you start using DUNE code.

source mysetup7.sh

Here are some example scripts that do most of the setups explained in this tutorial. You need to store these in your home area, source them every time you log in, and possibly update them as code versions evolve.

If you run into problems, check out the Common Error Messages page and the FAQ page

if that doesn’t help, use Slack to ask us about the problem - there is always a new one cropping up.

Instructional Crew

Organizers:

Module Authors (in order of appearance in the schedule):

Mentors

Support

You must be on the DUNE Collaboration member list and have a valid FNAL or CERN account. See the old [Indico Requirement page][indico-event-requirements] for more information. Windows users are invited to review the Windows Setup page.

You should join the DUNE Slack instance and look in #computing-training-basics for help with this tutorial

go to https://atwork.dunescience.org/tools/ scroll down to Slack and request an invite. Please do not do this if you are already in DUNE Slack.

The livedoc is here livedoc

Key Points

  • This tutorial is brought to you by the DUNE Computing Consortium.

  • The goals are to give you the computing basis to work on DUNE.


Introduction to art and LArSoft (2025 - Apptainer version)

Overview

Teaching: 50 min
Exercises: 0 min
Questions
  • Why do we need a complicated software framework? Can’t I just write standalone code?

Objectives
  • Learn what services the art framework provides.

  • Learn how the LArSoft tookit is organized and how to use it.

Session Video

The session video on December 10, 2024 was captured for your asynchronous review.

https://indico.cern.ch/event/1461779/overview

This page is protected by a password. Dom Brailsford sent this password in an e-mail to the DUNE Collaboration on November 6, 2024.

Introduction to art

Art is the framework used for the offline software used to process LArTPC data from the far detector and the ProtoDUNEs. It was chosen not only because of the features it provides, but also because it allows DUNE to use and share algorithms developed for other LArTPC experiments, such as ArgoNeuT, LArIAT, MicroBooNE and ICARUS. The section below describes LArSoft, a shared software toolkit. Art is also used by the NOvA and mu2e experiments. The primary language for art and experiment-specific plug-ins is C++.

The art wiki page is here: https://cdcvs.fnal.gov/redmine/projects/art/wiki. It contains important information on command-line utilities, how to configure an art job, how to define, read in and write out data products, how and when to use art modules, services, and tools.

Art features:

  1. Defines the event loop
  2. Manages event data storage memory and prevents unintended overwrites
  3. Input file interface – allows ganging together input files
  4. Schedules module execution
  5. Defines a standard way to store data products in art-formatted ROOT files
  6. Defines a format for associations between data products (for example, tracks have hits, and associations between tracks and hits can be made via art’s association mechanism).
  7. Provides a uniform job configuration interface
  8. Stores job configuration information in art-formatted root files.
  9. Output file control – lets you define output filenames based on parts of the input filename.
  10. Message handling
  11. Random number control
  12. Exception handling

The configuration storage is particularly useful if you receive a data file from a colleague, or find one in a data repository and you want to know more about how it was produced, with what settings.

Getting set up to try the tools - use SL7 for now!

Log in to a dunegpvm*.fnal.gov or lxplus.cern.ch machine and set up your environment (This script is defined in Exercise 5 of https://dune.github.io/computing-basics/setup.html)

Note

For now do this in the Apptainer. Due to the need to set up the container separately at CERN, on the build nodes, and on the gpvms due to /pnfs mounts being different, and the need to keep your environment clean for use on other experiments, it is best to define aliases in your .profile or .bashrc or other login script you use to define aliases. A set of convenient aliases is

alias dunesl7="/cvmfs/oasis.opensciencegrid.org/mis/apptainer/current/bin/apptainer shell --shell=/bin/bash -B /cvmfs,/exp,/nashome,/pnfs/dune,/opt,/run/user,/etc/hostname,/etc/hosts,/etc/krb5.conf --ipc --pid /cvmfs/singularity.opensciencegrid.org/fermilab/fnal-dev-sl7:latest"

alias dunesl7build="/cvmfs/oasis.opensciencegrid.org/mis/apptainer/current/bin/apptainer shell --shell=/bin/bash -B /cvmfs,/exp,/build,/nashome,/opt,/run/user,/etc/hostname,/etc/hosts,/etc/krb5.conf --ipc --pid /cvmfs/singularity.opensciencegrid.org/fermilab/fnal-dev-sl7:latest"

alias dunesl7CERN="/cvmfs/oasis.opensciencegrid.org/mis/apptainer/current/bin/apptainer shell --shell=/bin/bash -B /cvmfs,/afs,/opt,/run/user,/etc/hostname --ipc --pid /cvmfs/singularity.opensciencegrid.org/fermilab/fnal-dev-sl7:latest"

alias dunesetups="source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh"

Then you can use the appropriate alias to start the SL7 container on either the build node or the gpvms or lxplus. Starting a container gives you a very bare environment – it does not source your .profile for you; you have to do that yourself. The examples below assume you put the aliases above in your .profile or in a script sourced by your .profile. I always set the prompt variable PS1 in my profile so I can tell that I’ve sourced it.

PS1="<`hostname`> "; export PS1

Then when you log in, you can type these commands to set up your environment in a container:

dunesl7
source .profile
dunesetups

export DUNELAR_VERSION=v10_07_00d00
export DUNELAR_QUALIFIER=e26:prof
setup dunesw $DUNELAR_VERSION -q $DUNELAR_QUALIFIER

setup_fnal_security
# define a sample file
export SAMPLE_FILE=root://fndca1.fnal.gov:1094//pnfs/fnal.gov/usr/dune/persistent/users/schellma/tutorial_2025/NNBarAtm_hA_BR_dune10kt_1x2x6_54053565_607_20220331T192335Z_gen_g4_detsim_reco_65751406_0_20230125T150414Z_reReco.root

The examples below will refer to files in dCache at Fermilab which can best be accessed via xrootd.

For those with no access to Fermilab computing resources but with a CERN account:
Copies are stored in /afs/cern.ch/work/t/tjunk/public/jan2023tutorialfiles/.

The follow-up of this tutorial provides help on how to find data and MC files in storage.

You can list available versions of dunesw installed in CVMFS with this command:

ups list -aK+ dunesw

The output is not sorted, although portions of it may look sorted. Do not depend on it being sorted. The string indicating the version is called the version tag (v09_72_01d00 here). The qualifiers are e26 and prof. Qualifiers can be entered in any order and are separated by colons. “e26” corresponds to a specific version of the GNU compiler – v9.3.0. We also compile with clang – the compiler qualifier for that is “c7”.

“prof” means “compiled with optimizations turned on.” “debug” means “compiled with optimizations turned off”. More information on qualifiers is here.

In addition to the version and qualifiers, UPS products have “flavors”. This refers to the operating system type and version. Older versions of DUNE software supported SL6 and some versions of macOS. Currently only SL7 and the compatible CentOS 7 are supported. The flavor of a product is automatically selected to match your current operating system when you set up a product. If a product does not have a compatible flavor, you will get an error message. “Unflavored” products are ones that do not depend on the operating-system libraries. They are listed with a flavor of “NULL”.

There is a setup command provided by the operating system – you usually don’t want to use it (at least not when developing DUNE software). If you haven’t yet sourced the setup_dune.sh script in CVMFS above but type setup xyz anyway, you will get the system setup command, which will ask you for the root password. Just control-C out of it, source the setup_dune.sh script, and try again. On AL9 and the SL7 container, there is no system setup command so you will get “command not found” if you haven’t yet set up UPS.

UPS’s setup command (find out where it lives with this command):

type setup

will not only set up the product you specify (in the instructions above, dunesw), but also all dependent products with corresponding versions so that you get a consistent software environment. You can get a list of everything that’s set up with this command

 ups active

It is often useful to pipe the output through grep to find a particular product.

 ups active | grep geant4

for example, to see what version of geant4 you have set up.

To learn more about ups there is more documentation here.

Art command-line tools

All of these command-line tools have online help. Invoke the help feature with the --help command-line option. Example:

config_dumper --help

Docmentation on art command-line tools is available on the art wiki page.

config_dumper

Configuration information for a file can be printed with config_dumper.

config_dumper -P <artrootfile>

Try it out:

config_dumper -P $SAMPLE_FILE

The output is an executable fcl file, sent to stdout. We recommend redirecting the output to a file that you can look at in a text editor:

Try it out:

config_dumper -P $SAMPLE_FILE > tmp.fcl

Your shell may be configured with noclobber, meaning that if you already have a file called tmp.fcl, the shell will refuse to overwrite it. Just rm tmp.fcl and try again.

The -P option to config_dumper is needed to tell config_dumper to print out all processing configuration fcl parameters. The default behavior of config_dumper prints out only a subset of the configuration parameters, and is most notably missing art services configuration.

Quiz

Quiz questions from the output of the above run of config_dumper:

  1. What generators were used? What physics processes are simulated in this file?
  2. What geometry is used? (hint: look for “GDML” or “gdml”)
  3. What electron lifetime was assumed?
  4. What is the readout window size?

fhicl-dump

You can parse a FCL file with fhicl-dump.

Try it out:

fhicl-dump protoDUNE_refactored_g4_stage2.fcl

See the section below on FCL files for more information on what you’re looking at.

count_events

Try it out:

count_events $SAMPLE_FILE

product_sizes_dumper

You can get a peek at what’s inside an artROOT file with product_sizes_dumper.

Try it out:

product_sizes_dumper -f 0 $SAMPLE_FILE

It is also useful to redirect the output of this command to a file so you can look at it with a text editor and search for items of interest. This command lists the sizes of the TBranches in the Events TTree in the artROOT file. There is one TBranch per data product, and the name of the TBranch is the data product name, an “s” is appended (even if the plural of the data product name doesn’t make sense with just an “s” on the end), an underscore, then the module label that made the data product, an underscore, the instance name, an underscore, and the process name and a period.

Quiz questions, looking at the output from above.

Quiz

Questions:

  1. What is the name of the data product that takes up the most space in the file?
  2. What the module label for this data product?
  3. What is the module instance name for this data product? (This question is tricky. You have to count underscores here).
  4. How many different modules produced simb::MCTruth data products? What are their module labels?
  5. How many different modules produced recob::Hit data products? What are their module labels?

You can open up an artROOT file with ROOT and browse the TTrees in it with a TBrowser. Not all TBranches and leaves can be inspected easily this way, but enough can that it can save a lot of time programming if you just want to know something simple about a file such as whether it contains a particular data product and how many there are.

Try it out

root $SAMPLE_FILE

then at the root prompt, type:

new TBrowser

This will be faster with VNC. Navigate to the Events TTree in the file that is automatically opened, navigate to the TBranch with the Argon 39 MCTruths (it’s near the bottom), click on the branch icon simb::MCTruths_ar39__SinglesGen.obj, and click on the NParticles() leaf (It’s near the bottom. Yes, it has a red exclamation point on it, but go ahead and click on it). How many events are there? How many 39Ar decays are there per event on average?

Header files for many data products are in lardataobj and some are in nusimdata.

Art is not constrained to using ROOT files – we use HDF5-formatted files for some purposes. ROOT has nice browsing features for inspecting ROOT-formatted files; Some HDF5 data visualiztion tools exist, but they assume that data are in particular formats. ROOT has the ability to display more general kinds of data (C++ classes), but it needs dictionaries for some of the more complicated ones.

The art main executable program is a very short stub that interprets command-line options, reads in the configuration document (a FHiCL file which usually includes other FHiCL files), and loads shared libraries, initializes software components, and schedules execution of modules. Most code we are interested in is in the form of art plug-ins – modules, services, and tools. The generic executable for invoking art is called art, but a LArSoft-customized one is called lar. No additional customization has yet been applied so in fact, the lar executable has identical functionality to the art executable.

There is online help:

 lar --help

All programs in the art suite have a --help command-line option.

Most art job invocations take the form

lar -n <nevents> -c fclfile.fcl artrootfile.root

where the input file specification is just on the command line without a command-line option. Explicit examples follow below. The -n <nevents> is optional – it specifies the number of events to process. If omitted, or if <nevents> is bigger than the number of events in the input file, the job processes all of the events in the input file. -n <nevents> is important for the generator stage. There’s also a handy --nskip <nevents_to_skip> argument if you’d like the job to start processing partway through the input file. You can steer the output with

lar -c fclfile.fcl artrootfile.root -o outputartrootfile.root -T outputhistofile.root

The outputhistofile.root file contains ROOT objects that have been declared with the TFileService service in user-supplied art plug-in code (i.e. your code).

Job configuration with FHiCL

The Fermilab Hierarchical Configuration Language, FHiCL is described here https://cdcvs.fnal.gov/redmine/documents/327.

FHiCL is not a Turing-complete language: you cannot write an executable program in it. It is meant to declare values for named parameters to steer job execution and adjust algorithm parameters (such as the electron lifetime in the simulation and reconstruction). Look at .fcl files in installed job directories, like $DUNESW_DIR/fcl for examples. Fcl files are sought in the directory seach path FHICL_FILE_PATH when art starts up and when #include statements are processed. A fully-expanded fcl file with all the #include statements executed is referred to as a fhicl “document”.

Parameters may be defined more than once. The last instance of a parameter definition wins out over previous ones. This makes for a common idiom in changing one or two parameters in a fhicl document. The generic pattern for making a short fcl file that modifies a parameter is:

#include "fcl_file_that_does_almost_what_I_want.fcl"
block.subblock.parameter: new_value

To see what block and subblock a parameter is in, use fhcl-dump on the parent fcl file and look for the curly brackets. You can also use

lar -c fclfile.fcl --debug-config tmp.txt --annotate

which is equivalent to fhicl-dump with the –annotate option and piping the output to tmp.txt.

Entire blocks of parameters can be substituted in using @local and @table idioms. See the examples and documentation for guidance on how to use these. Generally they are defined in the PROLOG sections of fcl files. PROLOGs must precede all non-PROLOG definitions and if their symbols are not subsequently used they do not get put in the final job configuration document (that gets stored with the data and thus may bloat it). This is useful if there are many alternate configurations for some module and only one is chosen at a time.

Try it out:

fhicl-dump protoDUNE_refactored_g4_stage2.fcl > tmp.txt

Look for the parameter ModBoxA. It is one of the Modified Box Model ionization parameters. See what block it is in. Here are the contents of a modified g4 stage 2 fcl file that modifies just that parameter:

#include "protoDUNE_refactored_g4_stage2.fcl"
services.LArG4Parameters.ModBoxA: 7.7E-1

Exercise

Do a similar thing – modify the stage 2 g4 fcl configuration to change the drift field from 486.7 V/cm to 500 V/cm. Hint – you will find the drift field in an array of fields which also has the fields between wire planes listed.

Types of Plug-Ins

Plug-ins each have their own .so library which gets dynamically loaded by art when referenced by name in the fcl configuration.

Producer Modules
A producer module is a software component that writes data products to the event memory. It is characterized by produces<> and consumes<> statements in the class constructor, and art::Event::put() calls in the produces() method. A producer must produce the data product collection it says it produces, even if it is empty, or art will throw an exception at runtime. art::Event::put() transfers ownership of memory (use std::move so as not to copy the data) from the module to the art event memory. Data in the art event memory will be written to the output file unless output commands in the fcl file tell art not to do that. Documentation on output commands can be found in the LArSoft wiki here. Producer modules have methods that are called on begin job, begin run, begin subrun, and on each event, as well as at the end of processing, so you can initialize counters or histograms, and finish up summaries at the end. Source code must be in files of the form: modulename_module.cc, where modulename does not have any underscores in it.

Analyzer Modules
Analyzer modules read data products from the event memory and produce histograms or TTrees, or other output. They are typically scheduled after the producer modules have been run. Producer modules have methods that are called on begin job, begin run, begin subrun, and on each event, as well as at the end of processing, so you can initialize counters or histograms, and finish up summaries at the end. Source code must be in files of the form: modulename_module.cc, where modulename does not have any underscores in it.

Source Modules
Source modules read data from input files and reformat it as need be, in order to put the data in art event data store. Most jobs use the art-provided RootInput source module which reads in art-formatted ROOT files. RootInput interacts well with the rest of the framework in that it provides lazy reading of TTree branches. When using the RootInput source, data are not actually fetched from the file into memory when the source executes, but only when GetHandle or GetValidHandle or other product get methods are called. This is useful for art jobs that only read a subset of the TBranches in an input file. Code for sources must be in files of the form: modulename_source.cc, where modulename does not have any underscores in it. Monte Carlo generator jobs use the input source called EmptyEvent.

Services
These are singleton classes that are globally visible within an art job. They can be FHiCL configured like modules, and they can schedule methods to be called on begin job, begin run, begin event, etc. They are meant to help supply configuration parameters like the drift velocity, or more complicated things like geometry functions, to modules that need them. Please do not use services as a back door for storing event data outside of the art event store. Source code must be in files of the form: servicename_service.cc, where servicename does not have any underscores in it.

Tools
Tools are FHiCL-configurable software components that are not singletons, like services. They are meant to be swappable by FHiCL parameters which tell art which .so libraries to load up, configure, and call from user code. See the Art Wiki Page for more information on tools and other plug-ins.

You can use cetskelgen to make empty skeletons of art plug-ins. See the art wiki for documentation, or use

cetskelgen --help

for instructions on how to invoke it.

Ordering of Plug-in Execution

The constructors for each plug-in are called at job-start time, after the shared object libraries are loaded by the image activater after their names have been discovered from the fcl configuration. Producer, analyzer and service plug-ins have BeginJob, BeginRun, BeginSubRun, EndSubRun, EndRun, EndJob methods where they can do things like book histograms, write out summary information, or clean up memory.

When processing data, the input source always gets executed first, and it defines the run, subrun and event number of the trigger record being processed. The producers and filters in trigger_paths then get executed for each event. The analyzers and filters in end_paths then get executed. Analyzers cannot be added to trigger_paths, and producers cannot be added to end_paths. This ordering ensures that data products are all produced by the time they are needed to be analyzed. But it also forces high memory usage for the same reason.

Services and tools are visible to other plug-ins at any stage of processing. They are loaded dynamically from names in the fcl configurations, so a common error is to use in code a service that hasn’t been mentioned in the job configuration. You will get an error asking you to configure the service, even if it is just an empty configuration with the service name and no parameters set.

Non-Plug-In Code

You are welcome to write standard C++ code – classes and C-style functions are no problem. In fact, to enhance the portability of code, the art team encourages the separation of algorithm code into non-framework-specific source files, and to call these functions or class methods from the art plug-ins. Typically, source files for standalone algorithm code have the extension .cxx while art plug-ins have .cc extensions. Most directories have a CMakeLists.txt file which has instructions for building the plug-ins, each of which is built into a .so library, and all other code gets built and put in a separate .so library.

Retrieving Data Products

In a producer or analyzer module, data products can be retrieved from the art event store with getHandle() or getValidHandle() calls, or more rarely getManyByType or other calls. The arguments to these calls specify the module label and the instance of the data product. A typical TBranch name in the Events tree in an artROOT file is

simb::MCParticles_largeant__G4Stage1.

here, simb::MCParticle is the name of the class that defines the data product. The “s” after the data product name is added by art – you have no choice in this even if the plural of your noun ought not to just add an “s”. The underscore separates the data product name from the module name, “largeant”. Another underscore separates the module name and the instance name, which in this example is the empty string – there are two underscores together there. The last string is the process name and usually is not needed to be specified in data product retrieval. You can find the TBranch names by browsing an artroot file with ROOT and using a TBrowser, or by using product_sizes_dumper -f 0.

Art documentation

There is a mailing list – art-users@fnal.gov where users can ask questions and get help.

There is a workbook for art available at https://art.fnal.gov/art-workbook/ Look for the “versions” link in the menu on the left for the actual document. It is a few years old and is missing some pieces like how to write a producer module, but it does answer some questions. I recommend keeping a copy of it on your computer and using it to search for answers.

There was an art/LArSoft course in 2015. While it, too is a few years old, the examples are quite good and it serves as a useful reference.

Gallery is a lightweight tool that lets users read art-formatted root files and make plots without having to write and build art modules. It works well with interpreted and compiled ROOT macros, and is thus ideally suited for data exploration and fast turnaround of making plots. It lacks the ability to use art services, however, though some LArSoft services have been split into services and service providers. The service provider code is intended to be able to run outside of the art framework and linked into separate programs.

Gallery also lacks the ability to write data products to an output file. You are of course free to open and write files of your own devising in your gallery programs. There are example gallery ROOT scripts in duneexamples/duneexamples/GalleryScripts. They are only in the git repository but do not get installed in the UPS product.

More documentation: https://art.fnal.gov/gallery/

LArSoft

Introductory Documentation

LArSoft’s home page: larsoft.org

The LArSoft wiki is here: larsoft-wiki.

Software structure

The LArSoft toolkit is a set of software components that simulate and reconstruct LArTPC data, and also it provides tools for accessing raw data from the experiments. LArSoft contains an interface to GEANT4 (art does not list GEANT4 as a dependency) and the GENIE generator. It contains geometry tools that are adapted for wire-based LArTPC detectors.

LArSoft provides a collection of shared simulation, reconstruction, and analysis tools, with art interfaces. Often, a useful algorithm will be developed by an experimental collaboration, and desire to share it with other LArTPC collaborations, which is how much of the software in LArSoft came to be. Interfaces and services have to be standardized for shared use. Things like the detector geometry and the dead channel list, for example, are detector-specific, but shared simulation and reconstruction algorithms need to be able to access information from these services, which are not defined until an experiment’s software stack is set up and the lar program is invoked. LArSoft therefore uses plug-ins and class inheritance extensively to deal with these situations.

A recent graph (v10_00) of the UPS products in a full stack starting with dunesw is available here (dunesw). You can see the LArSoft pieces under dunesw, as well as GEANT4, GENIE, ROOT, and a few others.

LArSoft Data Products

A very good introduction to data products such as raw digits, calibrated waveforms, hits and tracks, that are created and used by LArSoft modules and usable by analyzers was given by Tingjun Yang at the 2019 ProtoDUNE analysis workshop (larsoft-data-products).

There are a number of data product dumper fcl files. A non-exhaustive list of useful examples is given below:

 dump_mctruth.fcl
 dump_mcparticles.fcl
 dump_simenergydeposits.fcl
 dump_simchannels.fcl
 dump_simphotons.fcl
 dump_rawdigits.fcl
 dump_wires.fcl
 dump_hits.fcl
 dump_clusters.fcl
 dump_tracks.fcl
 dump_pfparticles.fcl
 eventdump.fcl
 dump_lartpcdetector_channelmap.fcl
 dump_lartpcdetector_geometry.fcl

Some of these may require some configuration of input module labels so they can find the data products of interest.

Some of these may require some configuration of input module labels so they can find the data products of interest. Try one of these yourself:

lar -n 1 -c dump_mctruth.fcl $SAMPLE_FILE

This command will make a file called DumpMCTruth.log which you can open in a text editor. Reminder: MCTruth are particles made by the generator(s), and MCParticles are those made by GEANT4, except for those owned by the MCTruth data products. Due to the showering nature of LArTPCs, there are usually many more MCParticles than MCTruths.

Examples and current workflows

The page with instructions on how to find and look at ProtoDUNE data has links to standard fcl configurations for simulating and reconstructing ProtoDUNE data: https://wiki.dunescience.org/wiki/Look_at_ProtoDUNE_SP_data.

Try it yourself! The workflow for ProtoDUNE-SP MC is given in the Simulation Task Force web page.

Running on a dunegpvm machine at Fermilab

Warning - this takes time and has high peak memory use.

 export USER=`whoami`
 mkdir -p /exp/dune/data/users/$USER/tutorialtest
 cd /exp/dune/data/users/$USER/tutorialtest
 source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh

 export DUNELAR_VERSION=v10_07_00d00
 export DUNELAR_QUALIFIER=e26:prof
 setup dunesw $DUNELAR_VERSION -q $DUNELAR_QUALIFIER

 TMPDIR=/tmp 
 lar -n 1 -c mcc12_gen_protoDune_beam_cosmics_p1GeV.fcl -o gen.root
 lar -n 1 -c protoDUNE_refactored_g4_stage1.fcl gen.root -o g4_stage1.root
 lar -n 1 -c protoDUNE_refactored_g4_stage2_sce_datadriven.fcl g4_stage1.root -o g4_stage2.root
 lar -n 1 -c protoDUNE_refactored_detsim_stage1.fcl g4_stage2.root -o detsim_stage1.root
 lar -n 1 -c protoDUNE_refactored_detsim_stage2.fcl detsim_stage1.root -o detsim_stage2.root
 lar -n 1 -c protoDUNE_refactored_reco_35ms_sce_datadriven_stage1.fcl detsim_stage2.root -o reco_stage1.root
 lar -c eventdump.fcl reco_stage1.root >& eventdump_output.txt
 config_dumper -P reco_stage1.root >& config_output.txt
 product_sizes_dumper -f 0 reco_stage1.root >& productsizes.txt

Note added November 22, 2023: The construct “TMPDIR=/tmp lar …” defines the environment variable TMPDIR only for the duration of the subsequent command on the line. This is needed for the tutorial example because the mcc12 gen stage copies a 2.9 GB file (see below – it’s the one we had to copy over to CERN) to /var/tmp using ifdh’s default temporary location. But the dunegpvm machines as of November 2023 seem to rarely have 2.9 GB of space in /var/tmp and you get a “no space left on device” error. The newer prod4 versions of the fcls point to a newer version of the beam particle generator that can stream this file using XRootD instead of copying it with ifdh. But the streaming flag is turned off by default in the prod4 fcl for the version of dunesw used in this tutorial, and so this is the minimal solution. Note for the next iteration: the Prod4 fcls are here: https://wiki.dunescience.org/wiki/ProtoDUNE-SP_Production_IV

Run the event display on your new Monte Carlo event

 lar -c evd_protoDUNE_data.fcl reco_stage1.root

and push the “Reconstructed” radio button at the bottom of the display.

Display decoded raw digits

To look at some raw digits in the event display, you need to decode a DAQ file or find one that’s already been decoded. The decoder fcl for ProtoDUNE-HD data taken in 2024 is run_pdhd_wibeth3_tpc_decoder.fcl. An event display of an example decoded file is

 lar -c evd_protoDUNE_data.fcl /exp/dune/data/users/trj/nov2024tutorial/np04hd_raw_run028707_0075_dataflow5_datawriter_0_20240815T154544_decode.root

which is a file taken in August 2024.

Running on HDF5 raw data

One has to load (on the same line) a special library to stream HDF5 formatted data from vd-protodune and hd-protodune.

‘LD_PRELOAD=$XROOTD_LIB/libXrdPosixPreload.so ‘ has to be on the same line as your ‘lar’ command.

in your apptainer:

export DATA=root://ccxrootdegee.in2p3.fr:1094/pnfs/in2p3.fr/data/dune/disk/hd-protodune/d1/a6/np04hd_raw_run029147_0032_dataflow4_datawriter_0_20240912T110618.hdf5
LD_PRELOAD=$XROOTD_LIB/libXrdPosixPreload.so lar -c standard_reco_protodunehd_keepup.fcl $DATA -n 1

Running at CERN

This example puts all files in a subdirectory of your home directory. There is an input file for the ProtoDUNE-SP beamline simulation that is copied over and you need to point the generation job at it. The above sequence of commands will work at CERN if you have a Fermilab grid proxy, but not everyone signed up for the tutorial can get one of these yet, so we copied the necessary file over and adjusted a fcl file to point at it. It also runs faster with the local copy of the input file than the above workflow which copies it.

The apptainer command is slightly different as the mounts are different. Here we assume you are logged into an lxplus node running Alma9.

Note

CERN Apptainer variant

/cvmfs/oasis.opensciencegrid.org/mis/apptainer/current/bin/apptainer shell --shell=/bin/bash -B /cvmfs,/afs,/opt,/run/user,/etc/hostname,/etc/krb5.conf --ipc --pid  /cvmfs/singularity.opensciencegrid.org/fermilab/fnal-dev-sl7:latest

Make a fcl file and call it tmpgen.fcl

#include "mcc12_gen_protoDune_beam_cosmics_p1GeV.fcl"
physics.producers.generator.FileName: "/afs/cern.ch/work/t/tjunk/public/may2023tutorialfiles/H4_v34b_1GeV_-27.7_10M_1.root"

if you have difficulties opening an editor

echo '#include "mcc12_gen_protoDune_beam_cosmics_p1GeV.fcl"' > tmpgen.fcl
echo 'physics.producers.generator.FileName: "/afs/cern.ch/work/t/tjunk/public/may2023tutorialfiles/H4_v34b_1GeV_-27.7_10M_1.root"' >> tmpgen.fcl

then do some setup

 cd ~
 mkdir 2024Tutorial
 cd 2024Tutorial
 source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh

 export DUNELAR_VERSION=v10_07_00d00
 export LARSOFT_VERSION=${DUNELAR_VERSION}
 export DUNELAR_QUALIFIER=e26:prof
 setup dunesw $DUNELAR_VERSION -q $DUNELAR_QUALIFIER

Now you can run a sequence of lar steps to generate and reconstruct a file.

 lar -n 1 -c tmpgen.fcl -o gen.root
 lar -n 1 -c protoDUNE_refactored_g4_stage1.fcl gen.root -o g4_stage1.root
 lar -n 1 -c protoDUNE_refactored_g4_stage2_sce_datadriven.fcl g4_stage1.root -o g4_stage2.root
 lar -n 1 -c protoDUNE_refactored_detsim_stage1.fcl g4_stage2.root -o detsim_stage1.root
 lar -n 1 -c protoDUNE_refactored_detsim_stage2.fcl detsim_stage1.root -o detsim_stage2.root
 lar -n 1 -c protoDUNE_refactored_reco_35ms_sce_datadriven_stage1.fcl detsim_stage2.root -o reco_stage1.root
 lar -c eventdump.fcl reco_stage1.root >& eventdump_output.txt
 config_dumper -P reco_stage1.root >& config_output.txt
 product_sizes_dumper -f 0 reco_stage1.root >& productsizes.txt

You can also browse the root files with a TBrowser or run other dumper fcl files on them. The dump example commands above redirect their outputs to text files which you can edit with a text editor or run grep on to look for things.

You can run the event display with

lar -c evd_protoDUNE.fcl reco_stage1.root

but it will run very slowly over a tunneled X connection. A VNC session will be much faster. Tips: select the “Reconstructed” radio button at the bottom and click on “Unzoom Interest” on the left to see the reconstructed objects in the three views.

DUNE software documentation and how-to’s

The following legacy wiki page provides information on how to check out, build, and contribute to dune-specific larsoft plug-in code.

https://cdcvs.fnal.gov/redmine/projects/dunetpc/wiki

The follow-up part of this tutorial gives hands-on exercises for doing these things.

Contributing to LArSoft

The LArSoft git repositories are hosted on GitHub and use a pull-request model. LArSoft’s github link is https://github.com/larsoft. DUNE repositories, such as the dunesw stack, protoduneana and garsoft are also on GitHub but at the moment (not for long however), allow users to push code.

To work with pull requests, see the documentation at this link: https://larsoft.github.io/LArSoftWiki/Developing_With_LArSoft

There are bi-weekly LArSoft coordination meetings https://indico.fnal.gov/category/405/ at which stakeholders, managers, and users discuss upcoming releases, plans, and new features to be added to LArSoft.

Useful tip: check out an inspection copy of larsoft

A good old-fashioned grep -r or a find command can be effective if you are looking for an example of how to call something but I do not know where such an example might live. The copies of LArSoft source in CVMFS lack the CMakeLists.txt files and if that’s what you’re looking for to find examples, it’s good to have a copy checked out. Here’s a script that checks out all the LArSoft source and DUNE LArSoft code but does not compile it. Warning: it deletes a directory called “inspect” in your app area. Make sure /exp/dune/app/users/<yourusername> exists first:

Note

Remember the Apptainer! You can use your dunesl7 alias defined at the top of this page.

 #!/bin/bash
 USERNAME=`whoami`
 source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh
 cd /exp/dune/app/users/${USERNAME}
 rm -rf inspect
 mkdir inspect
 cd inspect
 mrb newDev
 source /exp/dune/app/users/${USERNAME}/inspect/localProducts*/setup
 cd srcs
 mrb g larsoft_suite
 mrb g larsoftobj_suite
 mrb g larutils
 mrb g larbatch
 mrb g dune_suite
 mrb g -d dune_raw_data dune-raw-data

Putting it to use: A very common workflow in developing software is to look for an example of how to do something similar to what you want to do. Let’s say you want to find some examples of how to use FindManyP – it’s an art method for retrieving associations between data products, and the art documentation isn’t as good as the examples for learning how to use it. You can use a recursive grep through your checked-out version, or you can even look through the installed source in CVMFS. This example looks through the duneprototype product’s source files for FindManyP:

 cd $DUNEPROTOTYPES_DIR/source/duneprototypes
 grep -r -i findmanyp *

It is good to use the -i option to grep which tells it to ignore the difference between uppercase and lowercase string matches, in case you misremembered the case of what you are looking for. The list of matches is quite long – you may want to pipe the output of that grep into another grep

 grep -r -i findmanyp * | grep recob::Hit

The checked-out versions of the software have the advantage of providing some files that don’t get installed in CVMFS, notably CMakeLists.txt files and the UPS product_deps files, which you may want to examine when looking for examples of how to do things.

GArSoft

GArSoft is another art-based software package, designed to simulate the ND-GAr near detector. Many components were copied from LArSoft and modified for the pixel-based TPC with an ECAL. You can find installed versions in CVMFS with the following command:

ups list -aK+ garsoft

and you can check out the source and build it by following the instructions on the GArSoft wiki.

Key Points

  • Art provides the tools physicists in a large collaboration need in order to contribute software to a large, shared effort without getting in each others’ way.

  • Art helps us keep track of our data and job configuration, reducing the chances of producing mystery data that no one knows where it came from.

  • LArSoft is a set of simulation and reconstruction tools shared among the liquid-argon TPC collaborations.


Bonus episode -- Code-makeover on how to code for better efficiency

Overview

Teaching: 50 min
Exercises: 0 min
Questions
  • How to write the most efficient code?

Objectives
  • Learn good tips and tools to improve your code.

Session Video

The session will be captured on video a placed here after the workshop for asynchronous study.

Live Notes

Code Make-over

How to improve your code for better efficiency

DUNE simulation, reconstruction and analysis jobs take a lot of memory and CPU time. This owes to the large size of the Far Detector modules as well as the many channels in the Near Detectors. Reading out a large volume for a long time with high granularity creates a lot of data that needs to be stored and processed.

CPU optimization:

Run with the “prof” build when launching big jobs. While both the “debug” and “prof” builds have debugging and profiling information included in the executables and shared libraries, the “prof” build has a high level of compiler optimization turned on while “debug” has optimizations disabled. Debugging with the “prof” build can be done, but it is more difficult because operations can be reordered and some variables get put in CPU registers instead of inspectable memory. The “debug” builds are generally much slower, by a factor of four or more. Often this difference is so stark that the time spent repeatedly waiting for a slow program to chug through the first trigger record in an interactive debugging session is more costly than the inconvenience of not being able to see some of the variables in the debugger. If you are not debugging, then there really is (almost) no reason to use the “debug” builds. If your program produces a different result when run with the debug build and the prof build (and it’s not just the random seed), then there is a bug to be investigated.

Compile your interactive ROOT scripts instead of running them in the interpreter At the ROOT prompt, use .L myprogram.C++ (even though its filename is myprogram.C). Also .x myprogram.C++ will compile and then execute it. This will force a compile. .L myprogram.C+ will compile it only if necessary.

Run gprof or other profilers like valgrind’s callgrind: You might be surprised at what is actually taking all the time in your program. There is abundant documentation on the web, and also the valgrind online documentation. There is no reason to profile a “debug” build and there is no need to hand-optimize something the compiler will optimize anyway, and which may even hurt the optimality of the compiler-optimized version.

The Debugger can be used as a simple profiler: If your program is horrendously slow (and/or it used to be fast), pausing it at any time is likely to pause it while it is doing its slow thing. Run your program in the debugger, pause it when you think it is doing its slow thing (i.e. after initialization), and look at the call stack. This technique can be handy because you can then inspect the values of variables that might give a clue if there’s a bug making your program slow. (e.g. looping over 1015 wires in the Far Detector, which would indicate a bug, such as an uninitialized loop counter or an unsigned loop counter that is initialized with a negative value.

Don’t perform calculations or do file i/o that will only later be ignored. It’s just a waste of time. If you need to pre-write some code because in future versions of your program the calculation is not ignored, comment it out, or put a test around it so it doesn’t get executed when it is not needed.

Extract constant calculations out of loops.

Code Example (BAD)
Code Example (GOOD)
double sum = 0;
for (size_t i=0; i<n_channels; ++i)
{
sum += result.at(i)/TMath::Sqrt(2.0);
}
double sum = 0;
double f = TMath::Sqrt(0.5);
for (size_t i=0; i<n_channels; ++i)
{
sum += result.at(i)*f;
}

The example above also takes advantage of the fact that floating-point multiplies generally have significantly less latency than floating-point divides (this is still true, even with modern CPUs).

Use sqrt(): Don’t use pow() or TMath::Power when a multiplication or sqrt() function can be used.

Code Example (BAD)
Code Example (GOOD)
double r = TMath::Power( TMath::Power(x,2) + TMath::Power(y,2), 0.5);
double r = TMath::Sqrt( x*x + y*y );

The reason is that TMath::Power (or the C math library’s pow()) function must take the logarithm of one of its arguments, multiply it by the other argument, and exponentiate the result. Modern CPUs have a built-in SQRT instruction. Modern versions of pow() or Power may check the power argument for 2 and 0.5 and instead perform multiplies and SQRT, but don’t count on it.

If the things you are squaring above are complicated expressions, use TMath::Sq() to eliminate the need for typing them out twice or creating temporary variables. Or worse, evaluating slow functions twice. The optimizer cannot optimize the second call to that function because it may have side effects like printing something out to the screen or updating some internal variable and you may have intended for it to be called twice.

Code Example (BAD)
Code Example (GOOD)
double r = TMath::Sqrt( slow_function_calculating_x()*
slow_function_calculating_x() +
slow_function_calculating_y()*
slow_function_calculating_y() );
double r = TMath::Sqrt( TMath::Sq(slow_function_calculating_x()) +
TMath::Sq(slow_function_calculating_y()));

Don’t call sqrt() if you don’t have to.

Code Example (BAD)
Code Example (GOOD)
if (TMath::Sqrt( x*x + y*y ) < rcut )
{
do_something();
}
double rcutsq = rcut*rcut;
if (x*x + y*y < rcutsq)
{
do_something();
}

Use binary search features in the STL rather than a step-by-step lookup.

std::vector<int> my_vector;
(fill my_vector with stuff)

size_t indexfound = 0;
bool found = false;
for (size_t i=0; i<my_vector.size(); ++i)
{
  if (my_vector.at(i) == desired_value)
    {
	indexfound = i;
	found = true;
    }
}

If you have to search through a list of items many times, it is best to sort it and use std::lower_bound; see the example here. std::map is sorted, and std::unordered_map uses a quicker hash table. Generally looking things up in maps is O(log(n)) and in a std::unordered_map is O(1) in CPU time, while searching for it from the beginning is O(n). The bad example above can be sped up by an average factor of 2 by putting a break statement after found=true; if you want to find the first instance of an object. If you want to find the last instance, just count backwards and stop at the first one you find; or use std::upper_bound.

Don’t needlessly mix floats and doubles.

Code Example (BAD)
Code Example (GOOD)
double sum = 0;
std::vector <double> results;
(fill lots of results)
for (size_t i=0; i<results.size(); ++i)
{
float rsq = results.at(i)*result.at(i);
sum += rsq;
}
double sum = 0;
std::vector <double> results;
(fill lots of results)
for (size_t i=0; i<results.size(); ++i)
{
sum += TMath::Sq(results.at(i));
}

Minimize conversions between int and float or double

The up-conversion from int to float takes time, and the down-conversion from float to int loses precision and also takes time. Sometimes you want the precision loss, but sometimes it’s a mistake.

Check for NaN and Inf. While your program will still function if an intermediate result is NaN or Inf (and it may even produce valid output, especially if the NaN or Inf is irrelevant), processing NaNs and Infs is slower than processing valid numbers. Letting a NaN or an Inf propagate through your calculations is almost never the right thing to do - check functions for domain validity (square roots of negative numbers, logarithms of zero or negative numbers, divide by zero, etc.) when you execute them and decide at that point what to do. If you have a lengthy computation and the end result is NaN, it is often ambiguous at what stage the computation failed.

Pass objects by reference. Especially big ones. C and C++ call semantics specify that objects are passed by value by default, meaning that the called method gets a copy of the input. This is okay for scalar quantities like int and float, but not okay for a big vector, for example. The thing to note then is that the called method may modify the contents of the passed object, while an object passed by value can be expected not to be modified by the called method.

Use references to receive returned objects created by methods That way they don’t get copied. The example below is from the VD coldbox channel map. Bad, inefficient code courtesy of Tom Junk, and good code suggestion courtesy of Alessandro Thea. The infotohcanmap object is a map of maps of maps: std::unordered_map<int,std::unordered_map<int,std::unordered_map<int,int> > > infotochanmap;

Code Example (BAD)
Code Example (GOOD)
int dune::VDColdboxChannelMapService::getOfflChanFromWIBConnectorInfo(int wib, int wibconnector, int cechan)
{
int r = -1;
auto fm1 = infotochanmap.find(wib);
if (fm1 == infotochanmap.end()) return r;
auto m1 = fm1->second;
auto fm2 = m1.find(wibconnector);
if (fm2 == m1.end()) return r;
auto m2 = fm2->second;
auto fm3 = m2.find(cechan);
if (fm3 == m2.end()) return r;
r = fm3->second;
return r;
int dune::VDColdboxChannelMapService::getOfflChanFromWIBConnectorInfo(int wib, int wibconnector, int cechan)
{
int r = -1;
auto fm1 = infotochanmap.find(wib);
if (fm1 == infotochanmap.end()) return r;
auto& m1 = fm1->second;
auto fm2 = m1.find(wibconnector);
if (fm2 == m1.end()) return r;
auto& m2 = fm2->second;
auto fm3 = m2.find(cechan);
if (fm3 == m2.end()) return r;
r = fm3->second;
return r;
}

Minimize cloning TH1’s. It is really slow.

Minimize formatted I/O. Formatting strings for output is CPU-consuming, even if they are never printed to the screen or output to your logfile. MF_LOG_INFO calls for example must prepare the string for printing even if it is configured not to output it.

Avoid using caught exceptions as part of normal program operation While this isn’t an efficiency issue or even a code readability issue, it is a problem when debugging programs. Most debuggers have a feature to set a breakpoint on thrown exceptions. This is sometimes necessary to use in order to track down a stubborn bug. Bugs that stop program execution like segmentation faults are sometimes easer to track down than caught exceptions (which often aren’t even bugs but sometimes they are). If many caught exceptions take place before the buggy one, then the breakpoint on thrown exceptions has limited value.

Use sparse matrix tools where appropriate. This also saves memory.

Minimize database access operations. Bundle the queries together in blocks if possible. Do not pull more information than is needed out of the database. Cache results so you don’t have to repeat the same data retrieval operation.

Use std::vector::reserve() in order to size your vector right if you know in advance how big it will be. std::vector() will, if you push_back() to expand it beyond its current size in memory, allocate twice the memory of the existing vector and copy the contents of the old vector to the new memory. This operation will be repeated each time you start with a zero-size vector and push_back a lot of data. Factorize your program into parts that do i/o and compute. That way, if you don’t need to do one of them, you can switch it off without having to rewrite everything. Example: Say you read data in from a file and make a histogram that you are sometimes interested in looking at but usually not. The data reader should not always make the histogram by default but it should be put in a separate module which can be steered with fcl so the computations needed to calculate the items to fill the histogram can be saved.

Memory optimization:

Use valgrind. Its default operation checks for memory leaks and invalid accesses. Search the output for the words “invalid” and “lost”. Valgrind is a UPS product you can set up along with everything else. It is set up as part of the dunesw stack.

setup valgrind
valgrind --leak-check=yes --suppressions=$ROOTSYS/etc/valgrind-root.supp myprog arg1 arg2

More information is available here. ROOT-specific suppressions are described here. You can omit them, but your output file will be cluttered up with messages about things that ROOT does routinely that are not bugs.

Use massif. massif is a heap checker, a tool provided with valgrind; see documentation here.

Free up memory after use. Don’t hoard it after your module’s exited.

Don’t constantly re-allocate memory if you know you’re going to use it again right away.

Use STL containers instead of fixed-size arrays, to allow for growth in size. Back in the bad old days (Fortran 77 and earlier), fixed-size arrays had to be declared at compile time that were as big as they possibly could be, both wasting memory on average and creating artificial cutoffs on the sizes of problems that could be handled. This behavior is very easy to replicate in C++. Don’t do it.

Be familiar with the structure and access idioms. These include std::vector, std::map, std::unordered_map, std::set, std::list.

Minimize the use of new and delete to reduce the chances of memory leaks. If your program doesn’t leak memory now, that’s great, but years from now after maintenance has been transferred, someone might introduce a memory leak.

Use move semantics to transfer data ownership without copying it.

Do not store an entire event’s worth of raw digits in memory all at once. Find some way to process the data in pieces.

Consider using more compact representations in memory. A float takes half the space of a double. A size_t is 64 bits long (usually). Often that’s needed, but sometimes it’s overkill.

Optimize the big uses and don’t spend a lot of time on things that don’t matter. If you have one instance of a loop counter that’s a size_t and it loops over a million vector entries, each of which is an int, look at the entries of the vector, not the loop counter (which ought to be on the stack anyway).

Rebin histograms. Some histograms, say binned in channels x ticks or channels x frequency bins for a 2D FFT plot, can get very memory hungry.

I/O optimization:

Do as much calculation as you can per data element read. You can spin over a TTree once per plot, or you can spin through the TTree once and make all the plots. ROOT compresses data by default on write and uncompresses it on readin, so this is both an I/O and a CPU issue, to minimize the data that are read.

Read only the data you need ROOT’s TTree access methods are set up to give you only the requested TBranches. If you use TTree::MakeClass to write a template analysis ROOT macro script, it will generate code that reads in all TBranches and leaves. It is easy to trim out the extras to speed up your workflow.

Saving compressed data reduces I/O time and storage needs. Even though compressing data takes CPU, a slow disk or network can mean your workflow is in fact faster to trade CPU time instead of the disk read time.

Stream data with xrootd You will wait less for your first event than if you copy the file, put less stress on the data storage elements, and have more reliable i/o with dCache.

Build time optimization:

Minimize the number of #included files. If you don’t need an #include, don’t use it. It takes time to find these files in the search path and include them.

Break up very large source files into pieces. g++’s analysis and optimization steps take an amount of time that grows faster than linearly with the number of source lines.

Use ninja instead of make Instructions are here

Workflow optimization:

Pre-stage your datasets It takes a lot of time to wait for a tape (sometimes hours!). CPUs are accounted by wall-clock time, whether you’re using them or not. So if your jobs are waiting for data, they will run slowly even if you optimized the CPU usage. Pre-stage your data!

Run a test job If you have a bug, you will save time by not submitting large numbers of jobs that might not work.

Write out your variables in your own analysis ntuples (TTrees) You will likely have to run over the same MC and data events repeatedly, and the faster this is the better. You will have to adjust your cuts, tune your algorithms, estimate systematic uncertainties, train your deep-learning functions, debug your program, and tweak the appearance of your plots. Ideally, if the data you need to do these operatios is available interctively, you will be able to perform these tasks faster. Choose a minimal set of variables to put in your ntuples to save on storage space.

Write out histograms to ROOTfiles and decorate them in a separate script You may need to experiment many times with borders, spacing, ticks, fonts, colors, line widths, shading, labels, titles, legends, axis ranges, etc. Best not to have to re-compute the contents when you’re doing this, so save the histograms to a file first and read it in to touch it up for presentation.

Software readability and maintainability:

Keep the test suite up to date dunesw and larsoft have many examples of unit tests and integration tests. A colleague’s commit to your code or even to a different piece of code or even a data file might break your code in unexpected, difficult-to-diagnose ways. The continuous integration (CI) system is there to catch such breakage, and even small changes in run time, memory consumption, and data product output.

Keep your methods short If you have loaded up a lot of functionality in a method, it may become hard to reuse the components to do similar things. A long method is probably doing a lot of different things that can be given meaningful names.

Update the comments when code changes Not many things are more confusing than an out-of-date-comment that refers to how code used to work long ago.

Update names when meaning changes As software evolves, the meaning of the variables may shift. It may be a quick fix to change the contents of a variable without changing its name, but some variables may then contain contents that is the opposite of what the variable name implies. While the code will run, future maintainers will get confused.

Use const frequently The const keyword prevents overwriting variables unintentionally. Constness is how art protects the data in its event memory. This mechanism is exposed to the user in that pointers to const memory must be declared as pointers to consts, or you will get obscure error messages from the compiler. Const can also protect you from yourself and your colleagues when you know that the contents of a variable ought not to change.

Use simple constructs even if they are more verbose Sometimes very clever, terse expressions get the job done, but they can be difficult for a human to understand if and when that person must make a change. There is an obfuscated C contest if you want to see examples of difficult-to-read code (that may in fact be very efficient! But people time is important, too).

Always initialize variables when you declare them Compilers will warn about the use of uninitialized variables, so you will get used to doing this anyway. The initialization step takes a little time and it is not needed if the first use of the memory is to set the variable, which is why compilers do not automatically initialize variables.

Minimize the scope of variables Often a variable will only have a meaningful value iniside of a loop. You can declare variables as you use them. Old langauges like Fortran 77 insisted that you declare all variables at the start of a program block. This is not true in C and C++. Declaring variables inside of blocks delimiated by braces means they will go out of scope when the program exits the block, both freeing the memory and preventing you from referring to the variable after the loop is done and only considering the last value it took. Sometimes this is the desired behaviour, though, and so this is not a blanket rule.

Coding for Thread Safety

Modern CPUs often have many cores available. It is not unusual for a grid worker node to have as many as 64 cores on it, and 128 GB of RAM. Making use of the available hardware to maximize throughput is an important way to optimize our time and resources. DUNE jobs tend to be “embarrassingly parallel”, in that they can be divided up into many small jobs that do not need to communicated with one another. Therefore, making use of all the cores on a grid node is usually as easy as breaking a task up into many small jobs and letting the grid schedulers work out what jobs run where. The issue however is effective memory usage. If several small jobs share a lot of memory whose contents do not change (code libraries loaded into RAM, geometry description, calibration constants), then one can group the work together into a single job that uses multiple threads to get the work done faster. If the memory usage of a job is dominated by per-event data, then loading multiple events’ worth of data in RAM in order to keep all the cores fed with data may not provide a noticeable improvement in the utilization of CPU time relative to memory time.

Sometimes multithreading has advantages within a trigger record. Data from different wires or APAs may be processed simultaneously. One thing software managers would like to make sure is controllable is the number of threads a program is allowed to spawn. Some grid sites do not have an automatic protection against a program that creates more threads than CPUs it has requested. Instead, a human operator may notice that the load on a system is far greater than the number of cores, and track down and ban the offending job sumitter (this has already happened on DUNE). If a program contains components, some of which manage their own threawds, then it becomes hard to manage the total thread count in a program. Multithreaded art keeps track of the total thread count using TBB, or Thread Building Blocks.

See this very thorough presentation by Kyle Knoepfel at the 2019 LArSoft workshop. Several other talks at the workshop also focus on multi-threaded software. In short, if data are shared between threads and they are mutable, this is a recipe for race conditions and non-reproducible behavior of programs. Giving each thread a separate instance of each object is one way to contain possible race conditions. Alternately, private and public class members which do not change or which have synchronous access methods can also help provide thread safety.

Key Points

  • CPU, memory, and build time optimizations are possible when good code practices are followed.


Multi Repository Build (mrb) system (2024)

Overview

Teaching: 10 min
Exercises: 0 min
Questions
  • How are different software versions handled?

Objectives
  • Understand the roles of the tool mrb

mrb

What is mrb and why do we need it?
Early on, the LArSoft team chose git and cmake as the software version manager and the build language, respectively, to keep up with industry standards and to take advantage of their new features. When we clone a git repository to a local copy and check out the code, we end up building it all. We would like LArSoft and DUNE code to be more modular, or at least the builds should reflect some of the inherent modularity of the code.

Ideally, we would like to only have to recompile a fraction of the software stack when we make a change. The granularity of the build in LArSoft and other art-based projects is the repository. So LArSoft and DUNE have divided code up into multiple repositories (DUNE ought to divide more than it has, but there are a few repositories already with different purposes). Sometimes one needs to modify code in multiple repositories at the same time for a particular project. This is where mrb comes in.

mrb stands for “multi-repository build”. mrb has features for cloning git repositories, setting up build and local products environments, building code, and checking for consistency (i.e. there are not two modules with the same name or two fcl files with the same name). mrb builds UPS products – when it installs the built code into the localProducts directory, it also makes the necessasry UPS table files and .version directories. mrb also has a tool for making a tarball of a build product for distribution to the grid. The software build example later in this tutorial exercises some of the features of mrb.

Command Action
mrb --help prints list of all commands with brief descriptions
mrb \<command\> --help displays help for that command
mrb gitCheckout clone a repository into working area
mrbsetenv set up build environment
mrb build -jN builds local code with N cores
mrb b -jN same as above
mrb install -jN installs local code with N cores
mrb i -jN same as above (this will do a build also)
mrbslp set up all products in localProducts…
mrb z get rid of everything in build area

Link to the mrb reference guide

Exercise 1

There is no exercise 5. mrb example exercises will be covered in a later session as any useful exercise with mrb takes more than 30 minutes on its own. Everyone gets 100% credit for this exercise!

Key Points

  • The multi-repository build (mrb) tool allows code modification in multiple repositories, which is relevant for a large project like LArSoft with different cases (end user and developers) demanding consistency between the builds.


Expert in the Room - LArSoft How to modify a module - in progress

Overview

Teaching: 15 min
Exercises: 0 min
Questions
  • How do I check out, modify, and build DUNE code?

Objectives
  • How to use mrb.

  • Set up your environment.

  • Download source code from DUNE’s git repository.

  • Build it.

  • Run an example program.

  • Modify the job configuration for the program.

  • Modify the example module to make a custom histogram.

  • Test the modified module.

  • Stretch goal – run the debugger.

First learn a bit about the MRB system

Link to the mrb episode

getting set up

You will need three login sessions. These have different environments set up.

Session 1

Start up session #1, editing code, on one of the dunegpvm*.fnal.gov interactive nodes. These scripts have also been tested on the lxplus.cern.ch interactive nodes.

Note Remember the Apptainer!

see below for special Apptainers for CERN and build machines.

Create two scripts in your home directory:

newDev2024Tutorial.sh should have these contents:

#!/bin/bash
export DUNELAR_VERSION=v10_07_00d00
export PROTODUNEANA_VERSION=$DUNELAR_VERSION
DUNELAR_QUALIFIER=e26:prof
DIRECTORY=2024tutorial
USERNAME=`whoami`
export WORKDIR=/exp/dune/app/users/${USERNAME}
if [ ! -d "$WORKDIR" ]; then
  export WORKDIR=`echo ~`
fi

source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh

cd ${WORKDIR}
touch ${DIRECTORY}
rm -rf ${DIRECTORY}
mkdir ${DIRECTORY}
cd ${DIRECTORY}
mrb newDev -q ${DUNELAR_QUALIFIER}
source ${WORKDIR}/${DIRECTORY}/localProducts*/setup
mkdir work
cd srcs
mrb g -t ${PROTODUNEANA_VERSION} protoduneana

cd ${MRB_BUILDDIR}
mrbsetenv
mrb i -j16

and setup2024Tutorial.sh should have these contents:

DIRECTORY=2024tutorial
USERNAME=`whoami`

source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh
export WORKDIR=/exp/dune/app/users/${USERNAME}
if [ ! -d "$WORKDIR" ]; then
  export WORKDIR=`echo ~`
fi

cd $WORKDIR/$DIRECTORY
source localProducts*/setup
cd work
setup dunesw $DUNELAR_VERSION -q $DUNELAR_QUALIFIER
mrbslp

Execute this command to make the first script executable.

  chmod +x newDev2024Tutorial.sh

It is not necessary to chmod the setup script. Problems writing to your home directory? Check to see if your Kerberos ticket has been forwarded.

  klist

Session 2

Start up session #2 by logging in to one of the build nodes, dunebuild02.fnal.gov or dunebuild03.fnal.gov. They have at least 16 cores apiece and the dunegpvm’s have only four, so builds run much faster on them. If all tutorial users log on to the same one and try building all at once, the build nodes may become very slow or run out of memory. The lxplus nodes are generally big enough to build sufficiently quickly. The Fermilab build nodes should not be used to run programs (people need them to build code!)

Note – interactive computers at Fermilab will print out how much RAM, swap, and CPU threads the node has when you log in. In general, builds that launch more processes than a machine has threads will not run any faster, but it will use more memory. So the command “mrb i -j16” above is intended to be run on a build node with at least 16 threads and enough memory to support 16 simultaneous invocations of the C++ compiler, which may take up to 2 GB per invocation.

Note you need a modified container on the build machines and at CERN as they don’t mount /pnfs

This is done to prevent people from running interactive jobs on the dedicated build machines.

FNAL build machines

# remove /pnfs/ for build machines
/cvmfs/oasis.opensciencegrid.org/mis/apptainer/current/bin/apptainer shell --shell=/bin/bash  -B /cvmfs,/exp,/nashome,/opt,/run/user,/etc/hostname,/etc/hosts,/etc/krb5.conf --ipc --pid /cvmfs/singularity.opensciencegrid.org/fermilab/fnal-dev-sl7:latest

CERN

/cvmfs/oasis.opensciencegrid.org/mis/apptainer/current/bin/apptainer shell --shell=/bin/bash -B /cvmfs,/afs,/opt,/run/user,/etc/hostname,/etc/krb5.conf --ipc --pid /cvmfs/singularity.opensciencegrid.org/fermilab/fnal-dev-sl7:latest

Download source code and build it

On the build node, execute the newDev script:

  ./newDev2024Tutorial.sh

Note that this script will delete the directory planned to store the source code and built code, and make a new directory, in order to start clean. Be careful not to execute this script then if you’ve worked on the code some, as this script will wipe it out and start fresh.

This build script will take a few minutes to check code out and compile it.

The mrb g command does a git clone of the specified repository with an optional tag and destination name. More information is available here and here.

Some comments on the build command

  mrb i -j16

The -j16 says how many concurrent processes to run. Set the number to no more than the number of cores on the computer you’re running it on. A dunegpvm machine has four cores, and the two build nodes each have 16. Running more concurrent processes on a computer with a limited number of cores won’t make the build finish any faster, but you may run out of memory. The dunegpvms do not have enough memory to run 16 instances of the C++ compiler at a time, and you may see the word killed in your error messages if you ask to run many more concurrent compile processes than the interactive computer can handle.

You can find the number of cores a machine has with

  cat /proc/cpuinfo

The mrb system builds code in a directory distinct from the source code. Source code is in $MRB_SOURCE and built code is in $MRB_BUILDDIR. If the build succeeds (no error messages, and compiler warnings are treated as errors, and these will stop the build, forcing you to fix the problem), then the built artifacts are put in $MRB_TOP/localProducts*. mrbslp directs ups to search in $MRB_TOP/localProducts* first for software and necessary components like fcl files. It is good to separate the build directory from the install directory as a failed build will not prevent you from running the program from the last successful build. But you have to look at the error messages from the build step before running a program. If you edited source code, made a mistake, built it unsuccessfully, then running the program may run successfully with the last version which compiled. You may be wondering why your code changes are having no effect. You can look in $MRB_TOP/localProducts* to see if new code has been added (look for the “lib” directory under the architecture-specific directory of your product).

Because you ran the newDev2024Tutorial.sh script instead of sourcing it, the environment it set up within it is not retained in the login session you ran it from. You will need to set up your environment again. You will need to do this when you log in anyway, so it is good to have that setup script. In session #2, type this:

  source setup2024Tutorial.sh
  cd $MRB_BUILDDIR
  mrbsetenv

The shell command “source” instructs the command interpreter (bash) to read commands from the file setup2024Tutorial.sh as if they were typed at the terminal. This way, environment variables set up by the script stay set up. Do the following in session #1, the source editing session:

source setup2024Tutorial.sh
  cd $MRB_SOURCE
  mrbslp

Run your program

YouTube Lecture Part 2: Start up the session for running programs – log in to a dunegpvm interactive computer for session #3

  source setup2024Tutorial.sh
  mrbslp
  setup_fnal_security

We need to locate an input file. Here are some tips for finding input data:

https://wiki.dunescience.org/wiki/Look_at_ProtoDUNE_SP_data

Data and MC files are typically on tape, but can be cached on disk so you don’t have to wait possibly a long time for the file to be staged in. Check to see if a sample file is in dCache or only on tape:

cache_state.py PDSPProd4a_protoDUNE_sp_reco_stage1_p1GeV_35ms_sce_datadriven_18800650_2_20210414T012053Z.root

Get the xrootd URL:

samweb get-file-access-url --schema=root PDSPProd4a_protoDUNE_sp_reco_stage1_p1GeV_35ms_sce_datadriven_18800650_2_20210414T012053Z.root

which should print the following URL:

root://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/protodune-sp/full-reconstructed/2021/mc/out1/PDSPProd4a/18/80/06/50/PDSPProd4a_protoDUNE_sp_reco_stage1_p1GeV_35ms_sce_datadriven_18800650_2_20210414T012053Z.root

Now run the program with the input file accessed by that URL:

lar -c analyzer_job.fcl root://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/protodune-sp/full-reconstructed/2021/mc/out1/PDSPProd4a/18/80/06/50/PDSPProd4a_protoDUNE_sp_reco_stage1_p1GeV_35ms_sce_datadriven_18800650_2_20210414T012053Z.root

CERN Users without access to Fermilab’s dCache: – example input files for this tutorial have been copied to /afs/cern.ch/work/t/tjunk/public/2024tutorialfiles/.

After running the program, you should have an output file tutorial_hist.root. Note – please do not store large rootfiles in /exp/dune/app! The disk is rather small, and we’d like to save it for applications, not data. But this file ought to be quite small. Open it in root

  root tutorial_hist.root

and look at the histograms and trees with a TBrowser. It is empty!

Adjust the program’s job configuration

In Session #1, the code editing session,

  cd ${MRB_SOURCE}/protoduneana/protoduneana/TutorialExamples/

See that analyzer_job.fcl includes clustercounter.fcl. The module_type line in that fcl file defines the name of the module to run, and ClusterCounter_module.cc just prints out a message in its analyze() method just prints out a line to stdout for each event, without making any histograms or trees.

Aside on module labels and types: A module label is used to identify which modules to run in which order in a trigger path in an art job, and also to label the output data products. The “module type” is the name of the source file: moduletype_module.cc is the filename of the source code for a module with class name moduletype. The build system preserves this and makes a shared object (.so) library that art loads when it sees a particular module_type in the configuration document. The reason there are two names here is so you can run a module multiple times in a job, usually with different inputs. Underscores are not allowed in module types or module labels because they are used in contexts that separate fields with underscores.

Let’s do something more interesting than ClusterCounter_module’s print statement.

Let’s first experiment with the configuration to see if we can get some output. In Session #3 (the running session),

  fhicl-dump analyzer_job.fcl > tmp.txt

and open tmp.txt in a text editor. You will see what blocks in there contain the fcl parameters you need to adjust. Make a new fcl file in the work directory called myana.fcl with these contents:

#include "analyzer_job.fcl"

physics.analyzers.clusterana.module_type: "ClusterCounter3"

Try running it:

  lar -c myana.fcl root://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/protodune-sp/full-reconstructed/2021/mc/out1/PDSPProd4a/18/80/06/50/PDSPProd4a_protoDUNE_sp_reco_stage1_p1GeV_35ms_sce_datadriven_18800650_2_20210414T012053Z.root

but you will get error messages about “product not found”. Inspection of ClusterCounter3_module.cc in Session #1 shows that it is looking for input clusters. Let’s see if we have any in the input file, but with a different module label for the input data.

Look at the contents of the input file:

  product_sizes_dumper root://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/protodune-sp/full-reconstructed/2021/mc/out1/PDSPProd4a/18/80/06/50/PDSPProd4a_protoDUNE_sp_reco_stage1_p1GeV_35ms_sce_datadriven_18800650_2_20210414T012053Z.root | grep -i cluster

There are clusters with module label “pandora” but not lineclusterdc which you can find in the tmp.txt file above. Now edit myana.fcl to say

#include "analyzer_job.fcl"

physics.analyzers.clusterana.module_type: "ClusterCounter3"
physics.analyzers.clusterana.ClusterModuleLabel: "pandora"

and run it again:

  lar -c myana.fcl root://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/protodune-sp/full-reconstructed/2021/mc/out1/PDSPProd4a/18/80/06/50/PDSPProd4a_protoDUNE_sp_reco_stage1_p1GeV_35ms_sce_datadriven_18800650_2_20210414T012053Z.root

Lots of information on job configuration via FHiCL is available at this link

Editing the example module and building it

YouTube Lecture Part 3: Now in session #1, edit ${MRB_SOURCE}/protoduneana/protoduneana/TutorialExamples/ClusterCounter3_module.cc

Add

#include "TH1F.h"

to the section with includes.

Add a private data member

TH1F *fTutorialHisto;

to the class. Create the histogram in the beginJob() method:

fTutorialHisto = tfs->make<TH1F>("TutorialHisto","NClus",100,0,500);

Fill the histo in the analyze() method, after the loop over clusters:

fTutorialHisto->Fill(fNClusters);

Go to session #2 and build it. The current working directory should be the build directory:

make install -j16

Note – this is the quicker way to re-build a product. The -j16 says to use 16 parallel processes, which matches the number of cores on a build node. The command

mrb i -j16

first does a cmake step – it looks through all the CMakeLists.txt files and processes them, making makefiles. If you didn’t edit a CMakeLists.txt file or add new modules or fcl files or other code, a simple make can save you some time in running the single-threaded cmake step.

Rerun your program in session #3 (the run session)

  lar -c myana.fcl root://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/protodune-sp/full-reconstructed/2021/mc/out1/PDSPProd4a/18/80/06/50/PDSPProd4a_protoDUNE_sp_reco_stage1_p1GeV_35ms_sce_datadriven_18800650_2_20210414T012053Z.root

Open the output file in a TBrowser:

  root tutorial_hist.root

and browse it to see your new histogram. You can also run on some data.

  lar -c myana.fcl -T dataoutputfile.root root://fndca1.fnal.gov/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/protodune-sp/full-reconstructed/2020/detector/physics/PDSPProd4/00/00/53/87/np04_raw_run005387_0041_dl7_reco1_13832298_0_20201109T215042Z.root

The -T dataoutputfile.root changes the output filename for the TTrees and histograms to dataoutputfile.root so it doesn’t clobber the one you made for the MC output.

This iteration of course is rather slow – rebuilding and running on files in dCache. Far better, if you are just changing histogram binning, for example, is to use the output TTree. TTree::MakeClass is a very useful way to make a script that reads in the TBranches of a TTree on a file. The workflow in this tutorial is also useful in case you decide to add more content to the example TTree.

Run your program in the debugger

gdb and ddd

As of January 2025, the Fermilab license for forge_tools ddt and map has expired and will not be renewed. To debug programs, we now have access to command-line gdb and ddd. Instructions for how to use both of these are available on the web. The version of gdb that comes with SL7 is quite old. gdb gets set up with dunesw however so you get a version that can debug programs compiled with modern versions of gcc and clang. The gui debugger ddd is also installed both in the AL9 suite on the dunegpvms, as well as in the defualt SL7 container. ddd uses gdb under the hood, but it provides convenience features for displaying data and setting breakpoints in the source window. There is an issue with assigning a pseudo-terminal in a SL7 container session that is fixed with a preloaded shared library.

  source /etc/profile.d/ddd.sh

defines an alias for ddd that sets LD_PRELOAD before running the debugger gui. Some of the advice in using the forge_tools debugger below is expected to be useful in running ddd and gdb at the command line, such as the need to find the appropriate version of the source, and stepping to find bugs.

Old forge_tools ddt instructions

YouTube Lecture Part 4: In session #3 (the running session)

  setup forge_tools

  ddt `which lar` -c myana.fcl root://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/protodune-sp/full-reconstructed/2021/mc/out1/PDSPProd4a/18/80/06/50/PDSPProd4a_protoDUNE_sp_reco_stage1_p1GeV_35ms_sce_datadriven_18800650_2_20210414T012053Z.root

Click the “Run” button in the window that pops up. The which lar is needed because ddt cannot find executables in your path – you have to specify their locations explicitly.

In session #1, look at ClusterCounter3_module.cc in a text editor that lets you know what the line numbers are. Find the line number that fills your new histogram. In the debugger window, select the “Breakpoints” tab in the bottom window, and usethe right-mouse button (sorry mac users – you may need to get an external mouse if you are using VNC. XQuartz emulates a three-button mouse I believe). Make sure the “line” radio button is selected, and type ClusterCounter3_module.cc for the filename. Set the breakpoint line at the line you want, for the histogram filling or some other place you find interesting. Click Okay, and “Yes” to the dialog box that says ddt doesn’t know about the source code yet but will try to find it when it is loaded.

Click the right green arrow to start the program. Watch the program in the Input/Output section. When the breakpoint is hit, you can browse the stack, inspect values (sometimes – it is better when compiled with debug), set more breakpoints, etc.

You will need Session #1 to search for code that ddt cannot find. Shared object libraries contain information about the location of the source code when it was compiled. So debugging something you just compiled usually results in a shared object that knows the location of the source, but installed code in CVMFS points to locations on the Jenkins build nodes.

Looking for source code:

Your environment has lots of variables pointing at installed code. Look for variables like

  PROTODUNEANA_DIR

which points to a directory in CVMFS.

  ls $PROTODUNEANA_DIR/source

or $LARDATAOBJ_DIR/include

are good examples of places to look for code, for example.

Checking out and committing code to the git repository

For protoduneana and dunesw, this wiki page is quite good. LArSoft uses GitHub with a pull-request model. See

https://cdcvs.fnal.gov/redmine/projects/larsoft/wiki/Developing_With_LArSoft

https://cdcvs.fnal.gov/redmine/projects/larsoft/wiki/Working_with_GitHub

Some handy tools for working with search paths

Tom has written some scripts and made aliases for convenience – finding files in search paths like FHCIL_FILE_PATH, or FW_SEARCH_PATH, and searching within those files for content. Have a look on the dunegpvms at /exp/dune/data/users/trj/texttools. There is a list of aliases in aliases.txt that can be run in your login script (such as .profile). Put the perl scripts and tkdiff and newtkdiff somewhere in your PATH. A common place to put your favorite convenience scripts is ${HOME}/bin, but make sure to add that to your PATH. The scripts tkdiff and newtkdiff are open-source graphical diff tools that run using TCL/TK.

Common errors and recovery

Version mismatch between source code and installed products

When you perform an mrbsetenv or a mrbslp, sometimes you get a version mismatch. The most common reason for this is that you have set up an older version of the dependent products. Dunesw depends on protoduneana, which depends on dunecore, which depends on larsoft, which depends on art, ROOT, GEANT4, and many other products. This picture shows the software dependency tree for dunesw v09_72_01_d00. If the source code is newer than the installed products, the versions may mismatch. You can check out an older version of the source code (see the example above) with

  mrb g -t <tag> repository

Alternatively, if you have already checked out some code, you can switch to a different tag using your local clone of the git repository.

  cd $MRB_SOURCE/<product>
  git checkout <tag>

Try mrbsetenv again after checking out a consistent version.

Telling what version is the right one

The versions of dependent products for a product you’re building from source are listed in the file $MRB_SOURCE/<product>/ups/product_deps`.

Sometimes you may want to know what the version number is of a product way down on the dependency tree so you can check out its source and edit it. Set up the product in a separate login session:

  source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh
  setup <product> $DUNELAR_VERSION -q $DUNELAR_QUALIFIER
  ups active

It usually is a good idea to pipe the output through grep to find a particular product version. You can get dependency information with

  ups depend <product> $DUNELAR_VERSION -q $DUNELAR_QUALIFIER

Note: not all dependencies of dependent products are listed by this command. If a product is already listed, it sometimes is not listed a second time, even if two products in the tree depend on it. Some products are listed multiple times.

There is a script in duneutil called dependency_tree.sh which makes graphical displays of dependency trees.

Inconsistent build directory

The directory $MRB_BUILD contains copies of built code before it gets installed to localProducts. If you change versions of the source or delete things, sometimes the build directory will have clutter in it that has to be removed.

  mrb z

will delete the contents of $MRB_BUILDDIR and you will have to type mrbsetenv again.

  mrb zd

will also delete the contents of localProducts. This can be useful if you are removing code and want to make sure the installed version also has it gone.

Inconsistent environment

When you use UPS’s setup command, a lot of variables get defined. For each product, a variable called <product>_DIR is defined, which points to the location of the version and flavor of the product. UPS has a command “unsetup” which often succeeds in undoing what setup does, but it is not perfect. It is possible to get a polluted environment in which inconsistent versions of packages are set up and it is too hard to repair it one product at a time. Logging out and logging back in again, and setting up the session is often the best way to start fresh.

The setup command is the wrong one

If you have not sourced the DUNE software setup script

  source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh

you will find that the setup command that is used instead is one provided by the operating system and it requires root privilege to execute and setup will ask you for the root password. Rather than typing that if you get in this situation, ctrl-c, source the setup_dune.sh script and try again.

Compiler and linker warnings and errors

Common messages from the g++ compiler are undeclared variables, uninitialized variables, mismatched parentheses or brackets, missing semicolons, checking unsigned variables to see if they are positive (yes, that’s a warning!) and other things. mrb is set up to tell g++ and clang to treat warnings as errors, so they will stop the build and you will have to fix them. Often undeclared variables or methods that aren’t members of a class messages result from having forgotten to include the appropriate include file.

The linker has fewer ways to fail than the compiler. Usually the error message is “Undefined symbol”. The compiler does not emit this message, so you always know this is in the link step. If you have an undefined symbol, one of three things may have gone wrong. 1) You may have mistyped it (usually this gets caught by the compiler because names are defined in header files). More likely, 2) You introduced a new dependency without updating the CMakeLists.txt file. Look in the CMakeLists.txt file that steers the building of the source code that has the problem. Look at other CMakeLists.txt files in other directories for examples of how to refer to libraries. ` MODULE_LIBRARIES are linked with modules in the ART_MAKE blocks, and LIB_LIBRARIES` are linked when building non-module libraries (free-floating source code, for algorithms). 3) You are writing new code and just haven’t gotten around to finishing writing something you called.

Out of disk quota

Do not store data files on the app disk! Sometimes the app disk fills up nonetheless, and there is a quota of 100 GB per user on it. If you need more than that for several builds, you have some options. 1) Use /exp/dune/data/users/<username>. You have a 400 GB quota on this volume. They are slower than the app disk and can get even slower if many users are accessing them simultaneously or transferring large amounts of data to or ofrm them. 3) Clean up some space on app. You may want to tar up an old release and store the tarball on the data volume or in dCache for later use.

Runtime errors

Segmentation faults: These do not throw errors that art can catch. They terminate the program immediately. Use the debugger to find out where they happened and why.

Exceptions that are caught. The ddt debugger has in its menu a set of standard breakpoints. You can instruct the debugger to stop any time an exception is thrown. A common exception is a vector accessed past its size using at(), but often these are hard to track down because they could be anywhere. Start your program with the debugger, but it is often a good idea to turn off the break-on-exception feature until after the geometry has been read in. Some of the XML parsing code throws a lot of exceptions that are later caught as part of its normal mode of operation, and if you hit a breakpoint on each of these and push the “go” button with your mouse each time, you could be there all day. Wait until the initialization is over, press “pause” and then turn on the breakpoints by exception.

If you miss, start the debugging session over again. Starting the session over is also a useful technique when you want to know what happened before a known error condition occurs. You may find yourself asking “how did it get in that condition? Set a breakpoint that’s earlier in the execution and restart the session. Keep backing up – it’s kind of like running the program in reverse, but it’s very slow. Sometimes it’s the only way.

Print statements are also quite useful for rare error conditions. If a piece of code fails infrequently, based on the input data, sometimes a breakpoint is not very useful because most of the time it’s fine and you need to catch the program in the act of misbehaving. Putting in a low-tech print statement, sometimes with a uniquely-identifying string so you can grep the output, can let you put some logic in there to print only when things have gone bad, or even if you print on each iteration, you can just look at the last bit of printout before a crash.

No authentication/permission

You will almost always need to have a valid Kerberos ticket in your session. Accessing your home directory on the Fermilab machines requires it. Find your tickets with the command

  klist

By default, they last for 25 hours or so (a bit more than a day). You can refresh them for another 25 hours (up to one week’s worth of refreshes are allowed) with

  kinit -R

If you have a valid ticket on one machine and want to refresh tickets on another, you can

k5push <nodename>

The safest way to get a new ticket to a machine is to kinit on your local computer (like your laptop) and log in again, making sure to forward all tickets. In a pinch, you can run kinit on a dunegpvm and enter your Kerberos password, but this is discouraged as bad actors can (and have!) installed keyloggers on shared systems, and have stolen passwords. DO NOT KEEP PRIVATE, PERSONAL INFORMATION ON FERMILAB COMPUTERS! Things like bank account numbers, passwords, and social security numbers are definitely not to be stored on public, shared computers. Running kinit -R on a shared machine is fine.

You will need a grid proxy to submit jobs and access data in dCache via xrootd or ifdh.

  setup_fnal_security

will use your valid Kerberos ticket to generate the necessary certificates and proxies.

https://wiki.dunescience.org/wiki/Presentation_of_LArSoft_May_2021

Key Points


End of the larsoft basics lesson - Continue on your own to learn how to submit batch jobs

Overview

Teaching: 5 min
Exercises: 0 min
Questions
  • How do I learn more?

Objectives
  • Find out about more documentation

  • Find out how to ask for help from collaborators.

Batch jobs

Batch job submission has been split out into

Batch jobs

You can ask questions here:

You can continue on with these additional modules.


Key Points

  • There is more documentation!

  • People are here to help


Closing Remarks

Overview

Teaching: 10 min
Exercises: 0 min
Questions
  • Are you fortified with enough information to start your event analysis?

Objectives
  • Reflect on the days of learning.

Closing Remarks

Two Days of Training

The instruction in this one day version of the DUNE computing workshop was provided by several experienced physicists and is based on years of experience.

The secure access to Fermilab computing systems and a familiarity with data storage are key components.

Data management and event processing tools were described and modeled.

Art and LArSoft were introduced

We are thankful for the instructor’s hard work, and for the numerous participants who joined.

Next Steps

Session recordings have been posted within each lesson after processed.

We invite you to bookmark this training site and to revisit the content regularly.

Point a colleague to the material.

Long Term Support

You made some excellent connections with computing experts and invite your continued dialog.

A DUNE Slack channel (#computing-training-basics) will remain available and we encourage your activity in the dialog.

See also the GitHub FAQ site for DUNE Computing.

Key Points

  • The DUNE Computing Consortium has presented this workshop so as to broaden the use of software tools used for analysis.