Workshop Welcome and Introduction
Overview
Teaching: 5 min
Exercises: 0 minQuestions
What should I expect in participating in this workshop?
Objectives
Introduce instructors and mentors.
Provide overview of the three schedule.
Spotlight helpful network provided by Slack channel.
DUNE Computing Consortium
The goal of the DUNE Computing Consortium is to establish a global computing network that can handle the massive data dumps DUNE will produce by distributing them across the grid. It coordinates all DUNE computing activities and provides to new members the documentation and training to acquaint them with the specific and DUNE software and resources.
Coordinator: Heidi Schellman (Oregon State University)
Tutorial Instructors
Organizers:
- Claire David (York University / FNAL)
- David DeMuth (Valley City State University)
Lecturers (in order of appearance in the schedule):
- Michael Kirby (FNAL): storage spaces
- Steven Timm (FNAL): data management
- Tom Junk (FNAL): LArSoft
- Kenneth Herner (FNAL): grid and batch job submission
Mentors
- Amit Bashyal (ANL)
- Carlos Sarasty (University of Cincinnati)
Schedule
The workshop is a one half-day version of the workshop that normally spans three days.
Session Video
The introduction from the December 2021 training was captured on video, and is provided here.
Opening Slides
The slides for the introduction of this tutorial can be found here, or as a PDF on the Indico site.
Live Docs
Live (on-the-fly) documentation for the first and second portions of the tutorial can be found as:
Support
There will be live documents linked from Indico for each Zoom session. You can write questions there, anonymously or not, and experts will reply. The chat on Zoom can quickly saturate so this is a more convenient solution and proved very successful at the previous training. We will collect all questions and release a Q&A after the event.
You can join DUNE’s Slack: dunescience.slack.com
. We created a special [channel][slack-computing-training-doc2021] computing_training_dec2021
for technical support, join here.
Key Points
This workshop is brought to you by the DUNE Computing Consortium.
The goals are to give you the computing basis to work on DUNE.
Storage Spaces
Overview
Teaching: 60 min
Exercises: 0 minQuestions
What are the types and roles of DUNE’s data volumes?
What are the commands and tools to handle data?
Objectives
Understanding the data volumes and their properties
Displaying volume information (total size, available size, mount point, device location)
Differentiating the commands to handle data between grid accessible and interactive volumes
Session Video
The storage spaces portion from the December 2021 training was captured on video, and is provided here.
Introduction
There are three types of storage volumes that you will encounter at Fermilab: local hard drives, network attached storage, and large-scale, distributed storage. Each has it’s own advantages and limitations, and knowing which one to use when isn’t all straightforward or obvious. But with some amount of foresight, you can avoid some of the common pitfalls that have caught out other users.
Vocabulary
What is immutable? Describing a file as immutable means that once that file is written to the volume it cannot be modified. It can only be read, moved, or deleted. A volume that only support immutable files is not a good choice for code or other files you want to change or edit often.
What is POSIX access? On interactive nodes, some volumes have POSIX access (Portable Operating System Interface Wikipedia) that allow users to directly read, write and modify using standard commands, e.g. vi, emacs, sed, or within the bash scripting language.
What is meant by ‘grid accessible’? Volumes that are only grid accessible require specific tool suites to enable access to files stored there or to copy files to the storage volume. This will be explained in the following sections.
Interactive POSIX storage volumes (General Purpose Virual Machines)
Home area is similar to the user’s local hard drive but network mounted
- access speed to the volume very high, on top of full POSIX access
- example: /nashome/k/kirby
- they are NOT safe to store certificates and tickets
- not accessible from grid worker nodes
- not for code developement (size of less than 2 GB)
- You need a valid Kerberos ticket in order to access files in your home area
- Periodic snapshots are taken so you can recover deleted files. /nashome/.snapshot
Locally mounted volumes are local physical disks, mounted directly on interactive node
- mounted on the machine with direct links to the /dev/ location
- used as temporary storage for infrastructure services (e.g. /var, /tmp,)
- can be used to store certificates and tickets (saved there by default w/ owner-read permission and other permissions disabled)
- usually very small and should not be used to store data files or for code development
- Data on these volumes is not backed up.
Network Attached Storage (NAS) behaves similar to a locally mounted volume
- functions similar to services such as Dropbox or OneDrive
- fast and stable access rate
- volumes available only on a limited number of computers or servers
- not available to on larger grid computing
- /dune/app has periodic snapshots in /dune/app/.snapshot, but /dune/data and /dune/data2 do not.
Grid-accessible storage volumes
At Fermilab, an instance of dCache+Enstore is used for large-scale, distributed storage with capacity for more than 100 PB of storage and O(10000) connections. Whenever possible, these storage elements should be accessed over xrootd (see next section) as the mount points on interactive nodes are slow and unstable. Here are the different dCache volumes:
Persistent dCache: DO NOT USE THIS VOLUME TO DISTRIBUTE CODE TARBALLS!!! the data in the file is actively available for reads at any time and will not be removed until manually deleted by user. Quotas will be established in the near future.
Scratch dCache: large volume shared across all experiments. When a new file is written to scratch space, older files are removed in order to make room for the newer file. Removal is based on Least Recently Used policy.
Resilient dCache: handles custom user code for their grid jobs, often in the form of a tarball. Inappropriate to store any other files here. Deprecated and should instead use RCDS via CVMFS
Tape-backed dCache: disk based storage areas that have their contents mirrored to permanent storage on Enstore tape.
Files are not always available for immediate read from disk, but may need to be ‘staged’ from tape first. Checking file status before access is critical.
Summary on storage spaces
Full documentation: Understanding Storage Volumes
In the following table, <exp> stands for the experiment (uboone, nova, dune, etc…)
Quota/Space | Retention Policy | Tape Backed? | Retention Lifetime on disk | Use for | Path | Grid Accessible | |
---|---|---|---|---|---|---|---|
Persistent dCache | No/~100 TB/exp | Managed by Experiment | No | Until manually deleted | immutable files w/ long lifetime NO CODE TARBALLS!!! | /pnfs/<exp>/persistent | Yes |
Scratch dCache | No/no limit | LRU eviction - least recently used file deleted | No | Varies, ~30 days (NOT guaranteed) | immutable files w/ short lifetime | /pnfs/<exp>/scratch | Yes |
Resilient dCache | No/no limit | Periodic eviction if file not accessed | No | Approx 30 days (your experiment may have an active clean up policy) | input tarballs with custom code for grid jobs (do NOT use for grid job outputs) | /pnfs/<exp>/resilient | Yes |
Tape backed | dCache No/O(10) PB | LRU eviction (from disk) | Yes | Approx 30 days | Long-term archive | /pnfs/dune/… | Yes |
NAS Data | Yes (~1 TB)/ 32+30 TB total | Managed by Experiment | No | Till manually deleted | Storing final analysis samples | /dune/data | No |
NAS App | Yes (~100 GB)/ ~15 TB total | Managed by Experiment | No | Until manually deleted | Storing and compiling software | /dune/app | No |
Home Area (NFS mount) | Yes (~10 GB) | Centrally Managed by CCD | No | Until manually deleted | Storing global environment scripts (All FNAL Exp) | /nashome/<letter>/<uid> | No |
Commands and tools
This section will teach you the main tools and commands to display storage information and access data.
The df command
To find out what types of volumes are available on a node can be achieved with the command df
. The -h
is for human readable format. It will list a lot of information about each volume (total size, available size, mount point, device location).
df -h
Exercise 1
From the output of the
df -h
command, identify:
- the home area
- the NAS storage spaces
- the different dCache volumes
ifdh
Another useful data handling command you will soon come across is ifdh. This stands for Intensity Frontier Data Handling. It is a tool suite that facilitates selecting the appropriate data transfer method from many possibilities while protecting shared resources from overload. You may see ifdhc, where c refers to client.
Here is an example to copy a file. Refer to the Mission Setup for the setting up the DUNETPC_VERSION
.
source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh
setup dunetpc $DUNETPC_VERSION -q e19:prof #use DUNETPC_VERSION v09_22_02
setup_fnal_security
cache_state.py /pnfs/dune/tape_backed/dunepro/physics/full-reconstructed/2019/mc/out1/PDSPProd2/22/60/37/10/PDSPProd2_protoDUNE_sp_reco_35ms_sce_off_23473772_0_452d9f89-a2a1-4680-ab72-853a3261da5d.root
ifdh cp root://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/physics/full-reconstructed/2019/mc/out1/PDSPProd2/22/60/37/10/PDSPProd2_protoDUNE_sp_reco_35ms_sce_off_23473772_0_452d9f89-a2a1-4680-ab72-853a3261da5d.root /dev/null
Resource: idfh commands
Exercise 2
Using the ifdh command, complete the following tasks:
- create a directory in your dCache scratch area (/pnfs/dune/scratch/users/${USER}/) called “DUNE_tutorial_Dec2021”
- copy your ~/.bashrc file to that directory.
- copy the .bashrc file from your scrtach directory DUNE_tutorial_Dec2021 dCache to /dev/null
- remove the directory DUNE_tutorial_Dec2021 using “ifdh rmdir /pnfs/dune/scratch/users/${USER}/DUNE_tutorial_Dec2021” Note, if the destination for an ifdh cp command is a directory instead of filename with full path, you have to add the “-D” option to the command line. Also, for a directory to be deleted, it must be empty.
xrootd
The eXtended ROOT daemon is software framework designed for accessing data from various architectures and in a complete scalable way (in size and performance).
XRootD is most suitable for read-only data access. XRootD Man pages
Issue the following commands and try to understand how the first command enables completing the parameters for the second command.
pnfs2xrootd /pnfs/dune/scratch/users/${USER}/
xrdfs root://fndca1.fnal.gov:1094/ ls /pnfs/fnal.gov/usr/dune/scratch/users/${USER}/
Let’s practice
Exercise 3
Using a combination of
ifdh
andxrootd
commands discussed previously:
- Use
ifdh locateFile
to find the directory for this filePDSPProd4a_protoDUNE_sp_reco_stage1_p1GeV_35ms_sce_off_43352322_0_20210427T162252Z.root
- Use
pnfs2xrootd
to get thexrootd
URI for that file.- Use
xrdcp
to copy that file to/dev/null
- Using
xrdfs
and thels
option, count the number of files in the same directory asPDSPProd4a_protoDUNE_sp_reco_stage1_p1GeV_35ms_sce_off_43352322_0_20210427T162252Z.root
Note that redirecting the standard output of a command into the command wc -l
will count the number of lines in the output text. e.g. ls -alrth ~/ | wc -l
Useful links to bookmark
- ifdh commands (redmine)
- Understanding storage volumes (redmine)
- How DUNE storage works: pdf
Key Points
Home directories are centrally managed by Computing Division and meant to store setup scripts and text files.
Home directories are NOT for storage of certificates or tokens.
Network attached storage (NAS) /dune/app is primarily for code development.
The NAS /dune/data is for store ntuples and small datasets.
dCache volumes (tape, resilient, scratch, persistent) offer large storage with various retention lifetime.
The tool suites idfh and XRootD allow for accessing data with appropriate transfer method and in a scalable way.
Data Management
Overview
Teaching: 45 min
Exercises: 0 minQuestions
What are the data management tools and software for DUNE?
How are different software versions handled?
What are the best data management practices?
Objectives
Learn how to access data from DUNE Data Catalog
Understand the roles of the tools UPS, mrb and CVMFS
Session Video
The data management portion from the December 2021 training was captured on video, and is provided here.
Introduction
DUNE data is stored around the world and the storage elements are not always organized in a way that they can be easily inspected. For this purpose we use the SAM web client.
What is SAM?
Sequential Access via Metadata (SAM) is a data catalog originally designed for the D0 and CDF high energy physics experiments at Fermilab. It is now used by most of the Intensity Frontier experiments at Fermilab. The most important objects cataloged in SAM are individual files and collections of files called datasets.
Data files themselves are not stored in SAM, their metadata and physical locations are, and via metadata, you can search for and locate collections of files. SAM also provides mechanisms for initiating and tracking file delivery through projects.
This lecture will show you how to access data files that have been defined to the DUNE Data Catalog. Execute the following commands after logging in to the DUNE interactive node, and sourcing the main dune setups.
What is Rucio?
Rucio is the next-generation Data Replica service and is part of DUNE’s new Distributed Data Management (DDM) system that is currently in deployment. Rucio has two functions:
- A rule-based system to get files to Rucio Storage Elements (RSEs) around the world and keep them there for the lifeimte of the file.
- To return the “nearest” replica of any data file for use either in interactive or batch file use. It is expected that most DUNE users will not be regularly using direct Rucio commands, but other wrapper scripts that call them indirectly.
As of the date of this Dec 2021 tutorial:
- The Rucio client is not installed as a part of the standard DUNE client software
- Most DUNE users are not yet enabled to use it. But when we do, some of the commands will look like this:
rucio list-file-replicas protodune-sp:np04_raw_run005801_0001_dl1.root
rucio download protodune-sp:np04_raw_run005801_0001_dl1.root
rucio list-rses
Back to SAM
General considerations
SAM was designed to ensure that large scale data-processing was done completely and accurately which leads to some features not always present in a generic catalog but very desirable if one wishes high standards of reproducibility and documentation in data analysis.
For example, at the time of the original design, the main storage medium was 8mm tapes using consumer-grade drives. Drive and tape failure rates were > 1%. Several SAM design concepts, notably luminosity blocks and parentage tracking, were introduced to allow accurate tracking of files and their associated normalization in a high error-rate environment.
- Description of the contents of data collections to allow later retrieval
- Tracking of object and collection parentage and description of processing transformations to document the full provenance of any data object and ensure accurate normalization
- Grouping of objects and collection into larger “datasets” based on their characteristics
- Storing physical location of objects
- Tracking of the processing of collections to allow reprocessing on failure and avoid double processing.
- Methods (“projects”) for delivering and tracking collections in multi-processing jobs
- Preservation of data about processing/storage for debugging/reporting
The first 3 goals relate to content and characteristics while the last 3 relate to data storage and processing tools.
Specifics
-
The current SAM implementation uses the file as the basic unit of information. Metadata is associated with the file name. Filenames must be unique in the system. This prevents duplication of data in a sample, as a second copy cannot be cataloged. This makes renaming a file very unwise. A very common practice is to include some of the metadata in the filename, both to make it easier to identify and to ensure uniqueness.
-
Metadata for a file can include file locations but does not have to. A file can have no location at all, or many. When you move or remove a file with an associated SAM location, you need to update the location information.
-
SAM does not move files.. It provides location information for a process to use in streaming or copying a file using its own methods. Temporary locations (such as on a grid node) need not be reported to SAM. Permanently storing or removing files requires both moving/removing the file itself and updating its metadata to reflect that location and is generally left up to special packages such as the Fermilab FTS (File Transfer Service) and SAM projects.
-
Files which are stored on disk or tape are expected to have appropriate file sizes and checksums. One can have duplicate instances of a file in different locations, but they must all be identical. If one reprocesses an input file, the output may well be subtly different (for example dates stored in the file itself can change the checksum). SAM should force you to choose which version, old or new, is acceptable. It will not let you catalog both with the same filename. As a result, if you get a named file out of SAM, you can be reasonably certain you got the right copy.
-
files with duplicate content but different names can be problematic. The reprocessed file mentioned in part 4, if renamed, could cause significant problems if it were allowed into the data sample along with the originals as a job processing all files might get both copies. This is one of the major reasons for the checksums and unique filenames. There is a temptation to put, for example, timestamps, in filenames to generate unique names but that removes a protection against duplication.
-
Files can have parents and children and, effectively, birth certificates that can tell you how they were made. An example would be a set of raw data files RAWnnn processed with code X to produce a single reconstructed file RECO. One can tell SAM that RAWnnn are the parents of RECO processed with version x of code X. If one later finds another RAWnnn file that was missed in processing, SAM can tell you it has not been processed yet with X (i.e., it has no children associated with version x of X) and you can then choose to process that file. This use case often occurs when a production job fails without reporting back or succeeds but the copy back or catalog action fails. Note: The D0 experiment required that all official processing going into SAM be done with tagged releases and fixed parameter sets to increase reproducibility and the tags for that information were included in the metadata. Calibration databases were harder to timestamp so some variability was still possible if calibrations were updated.
-
SAM supports several types of data description fields:
Values are standard across all implementations like run_type, file_size …
Parameters are defined by the experiment for example MC.Genieversion
Values are common to almost all HEP experiments and are optimized for efficient queries. SAM also allows definition of “parameters” (by administrators) as they are needed. This allows the schema to be modified easily as needs arrive.
-
Metadata can also contain “spill” or luminosity block information that allows a file to point to specific data taking periods with smaller granularity than a run or subrun. When files are merged, this spill information is also merged.
-
SAM currently does not contain a means of easily determining which file a given event is in. If a daq system is writing multiple streams, an event from a given subrun could be in any stream. Adding an event database would be a useful feature.
All of these features are intended to assure that your data are well described and can be found. As SAM stores full location information, this means any SAM-visible location. In addition, if parentage information is provided, you can determine and reproduce the full provenance of any file.
Datasets and projects
Datasets
In addition to the files themselves, SAM allows you to define datasets.
A SAM dataset is not a fixed list of files but a query against the SAM database. An example would be “data_tier reconstructed and run_number 2001 and version v10” which would be all files from run 2001 that are reconstructed data produced by version v10. This dataset is dynamic. If one finds a missing file from run 2001 and reconstructs it with v10, the dataset will grow. There are also dataset snapshots that are derived from datasets and capture the exact files in the dataset when the snapshot was made. Note: most other data catalogs assume a “dataset” is a fixed list of files. This is a “snapshot” in SAM.
Projects
SAM also supports access tracking mechanisms called projects and consumers. These are generally implemented for you by grid processing scripts. Your job is to choose a dataset and then ask the processing system to launch a project for that dataset.
A project is effectively a processing campaign across a dataset which is owned by the SAM system. At launch a snapshot is generated and then the files in the snapshot are delivered to a set of consumers. The project maintains an internal record of the status of the files and consumers. Each grid process can instantiate a consumer which is attached to the project. Those consumers then request “files” from the project and, when done processing, tell the project of their status.
The original SAM implementation actually delivered the files to local hard drives. Modern SAM delivers the location information and expects the consumer to find the optimal delivery method. This is a pull model, where the consuming process requests the next file rather than having the file assigned to it. This makes the system more robust on distributed systems.
See running projects here.
Accessing the database in read mode
Checking the database does not require special privileges but storing files and running projects modifies the database and requires authentication to the right experimental group. kx509
authentication and membership in the experiment VO are needed.
Administrative actions like adding new values are restricted to a small set of superusers for each experiment.
Suggestions for configuring SAM (for admins)
First of all, it really is nice to have filenames and dataset names that tell you what’s in the box, although not required. The D0 and MINERvA conventions have been to use “_” underscores between useful key strings. As a result, D0 and MINERvA tried not to use “_” in metadata entries to allow cleaner parsing. “-“ is used if needed in the metadata.
D0 also appended processing information to filenames as they moved through the system to assure that files run through different sequences had unique identifiers.
Example: A Monte Carlo simulation file generated with version v3 and then reconstructed with v5 might look like
SIM_MC_020000_0000_simv3.root would be a parent of RECO_MC_020000_0000_simv3_recov5.root
Data files are all children of the raw data while simulation files sometimes have more complicated ancestry, with both unique generated events and overlay events from data as parents.
Setting up SAM metadata (For admins)
This needs to be done once, and very carefully, early in the experiment. It can grow but thinking hard at the beginning saves a lot of pain later.
You need to define data_tiers. These represent the different types of data that you produced through your processing chain. Examples would be raw
, pedsup
, calibrated
, reconstructed
, thumbnail
, mc-generated
, mc-geant
, mc-overlaid
.
run_type
can be used to support multiple DAQ instances.
data_stream
is often used for trigger subsamples that you may wish to split data into (for example pedestal vs data runs).
Generally, you want to store data from a given data_tier with other data from that tier to facilitate fast sequential access.
Applications
It is useful, but not required to also define applications which are triads of “appfamily”, “appname” and “version”. Those are used to figure out what changed X to Y. There are also places to store the machine the application ran on and the start and end time for the job.
The query:
samweb list-files "data_tier raw and not isparentof: (data_tier reconstructed and appname reco and version 7)"
Should, in principle, list raw data files not yet processed by version 7 of reco to produce files of tier reconstructed. You would use this to find lost files in your reconstruction after a power outage.
It is good practice to also store the name of the head application configuration file for processing but this does not have a standard “value.”
samweb client
samweb – samweb
is the command line and python API that allows queries of the SAM metadata, creation of datasets and tools to track and deliver information to batch jobs.
samweb can be acquired from ups via:
source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh
setup dunetpc $DUNETPC_VERSION -q e19:prof #use DUNETPC_VERSION v09_22_02
setup_fnal_security
samweb
allows you to select on a lot of parameters which are documented here:
-
dune-data.fnal.gov lists some official dataset definitions
This exercise will start you accessing data files that have been defined to the DUNE Data Catalog.
Example metadata from DUNE
Here are some examples of querying sam to get file information
$ samweb get-metadata np04_raw_run005141_0015_dl10_reco_12736632_0_20181028T182951.root –json
{
"file_name": "np04_raw_run005141_0015_dl10_reco_12736632_0_20181028T182951.root",
"file_id": 7352771,
"create_date": "2018-10-29T14:59:42+00:00",
"user": "dunepro",
"update_date": "2018-11-28T17:07:30+00:00",
"update_user": "schellma",
"file_size": 14264091111,
"checksum": [
"enstore:1390300706",
"adler32:e8bf4e23"
],
"content_status": "good",
"file_type": "detector",
"file_format": "artroot",
"data_tier": "full-reconstructed",
"application": {
"family": "art",
"name": "reco",
"version": "v07_08_00_03"
},
"event_count": 108,
"first_event": 21391,
"last_event": 22802,
"start_time": "2018-10-28T17:34:58+00:00",
"end_time": "2018-10-29T14:55:42+00:00",
"data_stream": "physics",
"beam.momentum": 7.0,
"data_quality.online_good_run_list": 1,
"detector.hv_value": 180,
"DUNE_data.acCouple": 0,
"DUNE_data.calibpulsemode": 0,
"DUNE_data.DAQConfigName": "np04_WibsReal_Ssps_BeamTrig_00021",
"DUNE_data.detector_config": "cob2_rce01:cob2_rce02:cob2.. 4 more lines of text",
"DUNE_data.febaselineHigh": 2,
"DUNE_data.fegain": 2,
"DUNE_data.feleak10x": 0,
"DUNE_data.feleakHigh": 1,
"DUNE_data.feshapingtime": 2,
"DUNE_data.inconsistent_hw_config": 0,
"DUNE_data.is_fake_data": 0,
"runs": [
[
5141,
1,
"protodune-sp"
]
],
"parents": [
{
"file_name": "np04_raw_run005141_0015_dl10.root",
"file_id": 6607417
}
]
}
To find the files produced from this file (children):
$ samweb file-lineage children np04_raw_run005141_0015_dl10.root
np04_raw_run005141_0015_dl10_reco_12736632_0_20181028T182951.root
To find files used to produce this file (parents):
$ samweb file-lineage parents np04_raw_run005141_0015_dl10_reco_12736632_0_20181028T182951.root
np04_raw_run005141_0015_dl10.root
Locating data
If you know the full filename and want to locate it, e.g.:
samweb locate-file np04_raw_run005758_0001_dl3.root
This will give you output that looks like:
rucio:protodune-sp
cern-eos:/eos/experiment/neutplatform/protodune/rawdata/np04/detector/None/raw/07/42/28/49
castor:/neutplatform/protodune/rawdata/np04/detector/None/raw/07/42/28/49
enstore:/pnfs/dune/tape_backed/dunepro/protodune/np04/beam/detector/None/raw/07/42/28/49(597@vr0337m8)
which is the locations of the file on disk and tape. But if we want to copy the file from tape to our local disk, then we need the file access URIs:
$ samweb get-file-access-url np04_raw_run005141_0015_dl10_reco_12736632_0_20181028T182951.root
gsiftp://eospublicftp.cern.ch/eos/experiment/neutplatform/protodune/rawdata/np04/output/detector/full-reconstructed/07/35/27/71/np04_raw_run005141_0015_dl10_reco_12736632_0_20181028T182951.root
gsiftp://fndca1.fnal.gov:2811/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/protodune/np04/beam/output/detector/full-reconstructed/07/35/27/71/np04_raw_run005141_0015_dl10_reco_12736632_0_20181028T182951.root
Here we have shown the gridftp transfer URIs, but in general it is better to stream data with xrootd and so you should add “–schema=root” to the samweb command. This is shown in the next section.
Accessing data for use in your analysis
To access data without copying it, XRootD is the tool to use. However it will work only if the file is staged to the disk.
You can stream files worldwide if you have a DUNE VO certificate as described in the preparation part of this tutorial. Better yet, you can use xrootd to access the file without copying it if it is staged to disk. Find the xrootd uri via
samweb get-file-access-url np04_raw_run005141_0001_dl7.root --schema=root
root://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/protodune/np04/beam/detector/None/raw/06/60/59/05 /np04_raw_run005141_0001_dl7.root
root://castorpublic.cern.ch//castor/cern.ch/neutplatform/protodune/rawdata/np04/detector/None/raw/06/60/59/05/np04_raw_run005141_0001_dl7.root
root://eospublic.cern.ch//eos/experiment/neutplatform/protodune/rawdata/np04/detector/None/raw/06/60/59/05/np04_raw_run005141_0001_dl7.root
You can localize your file with the --location
argument (enstore, castor, cern-eos)
samweb get-file-access-url np04_raw_run005141_0001_dl7.root --schema=root --location=enstore
root://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/protodune/np04/beam/detector/None/raw/06/60/59/05 /np04_raw_run005141_0001_dl7.root
Querying SAM Metadata catalog for files
To list raw data files for a given run:
samweb list-files "run_number 5758 and run_type protodune-sp and data_tier raw"
np04_raw_run005758_0001_dl3.root
np04_raw_run005758_0002_dl2.root
...
np04_raw_run005758_0065_dl10.root
np04_raw_run005758_0065_dl4.root
What about a reconstructed version?
samweb list-files "run_number 5758 and run_type protodune-sp and data_tier full-reconstructed and version (v07_08_00_03,v07_08_00_04)"
np04_raw_run005758_0053_dl7_reco_12891068_0_20181101T222620.root
np04_raw_run005758_0025_dl11_reco_12769309_0_20181101T213029.root
np04_raw_run005758_0053_dl2_reco_12891066_0_20181101T222620.root
...
np04_raw_run005758_0061_dl8_reco_14670148_0_20190105T175536.root
np04_raw_run005758_0044_dl6_reco_14669100_0_20190105T172046.root
The above is truncated output to show us the one reconstructed file that is the child of the raw data file above.
We also group reconstruction versions into Campaigns like PDSPProf4
samweb list-files "run_number 5141 and run_type protodune-sp and data_tier full-reconstructed and DUNE.campaign PDSPProd4"
Gives more recent files like:
np04_raw_run005141_0009_dl1_reco1_18126423_0_20210318T102429Z.root
To see the total number of files that match a certain query expression, then add the --summary
option to samweb list-files
.
Creating a dataset
You can make your own samweb dataset definitions: First, make certain a definition does not already exist that satisfies your needs by checking the official pages above.
Then check to see what you will get:
samweb list-files --summary "data_tier full-reconstructed and DUNE.campaign PDSPProd4 and data_stream physics and run_type protodune-sp and detector.hv_value 180" –summary
samweb create-definition $USER-PDSPProd4_good_physics "data_tier full-reconstructed and DUNE.campaign PDSPProd4 and data_stream physics and run_type protodune-sp and detector.hv_value 180"
Note that the username
appears in the definition name - to prevent users from getting confused with official samples, your user name is required in the definition name.
check to see if a file is on tape or disk
sam_validate_dataset --locality --file np04_raw_run005141_0015_dl10.root --location=/pnfs/ --stage_status
Staging status for: file np04_raw_run005141_0015_dl10.root
Total Files: 1
Tapes spanned: 1
Percent files on disk: 0%
Percent bytes online DCache: 0%
locality counts:
ONLINE: 0
NEARLINE: 1
NEARLINE_size: 8276312581
Oops - this one is not on disk since ONLINE is not 1. ONLINE: 1 if available on disk
sam_validate_dataset --locality --name=schellma-1GeVMC-test --stage_status --location=/pnfs/
Staging status for: defname:schellma-1GeVMC-test
Total Files: 140
Tapes spanned: 10
Percent files on disk: 100%
Percent bytes online DCache: 100%
locality counts:
ONLINE: 0
ONLINE_AND_NEARLINE: 140
ONLINE_AND_NEARLINE_size: 270720752891
No ONLINE_AND_NEARLINE
means you need to prestage that file. Unfortunately, prestaging requires a definition.
The official Protodune dataset definitions are here.
Resource: Using the SAM Data Catalog.
Exercise 1
- Use the
--location
argument to show the path of the file above on eitherenstore
,castor
orcern-eos
.- Use
get-metadata
to get SAM metadata for this file. Note that--json
gives the output in json format.
When we are analyzing large numbers of files in a group of batch jobs, we use a SAM snapshot to describe the full set of files that we are going to analyze and create a SAM Project based on that. Each job will then come up and ask SAM to give it the next file in the list. SAM has some capability to grab the nearest copy of the file. For instance if you are running at CERN and analyzing this file it will automatically take it from the CERN storage space EOS.
Exercise 2
- use the samweb describe-definition command to see the dimensions of data set PDSPProd4_MC_1GeV_reco1_sce_datadriven_v1
- use the samweb list-definition-files command with the –summary option to see the total size of PDSPProd4_MC_1GeV_reco1_sce_datadriven_v1
- use the samweb take-snapshot command to make a snapshot of PDSPProd4_MC_1GeV_reco1_sce_datadriven_v1
What is UPS and why do we need it?
An important requirement for making valid physics results is computational reproducibility. You need to be able to repeat the same calculations on the data and MC and get the same answers every time. You may be asked to produce a slightly different version of a plot for example, and the data that goes into it has to be the same every time you run the program.
This requirement is in tension with a rapidly-developing software environment, where many collaborators are constantly improving software and adding new features. We therefore require strict version control; the workflows must be stable and not constantly changing due to updates.
DUNE must provide installed binaries and associated files for every version of the software that anyone could be using. Users must then specify which version they want to run before they run it. All software dependencies must be set up with consistent versions in order for the whole stack to run and run reproducibly.
The Unix Product Setup (UPS) is a tool to handle the software product setup operation.
UPS is set up when you setup DUNE:
source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh
This sourcing defines the UPS setup
command. Now to get DUNE’s LArSoft-based software, this is done through:
setup dunetpc $DUNETPC_VERSION -q e19:prof
dunetpc
: product name
$DUNETPC_VERSION
version tag
e19:prof
are “qualifiers”. Qualifiers are separated with colons and may be specified in any order. The “e19” qualifier refers to a specific version of the gcc compiler suite, and “prof” means select the installed product that has been compiled with optimizations turned on. An alternative to “prof” is the “debug” qualifier. All builds of LArSoft and dunetpc are compiled with debug symbols turned on, but the “debug” builds are made with optimizations turned off. Both kinds of software can be debugged, but it is easier to debug the debug builds (code executes in the proper order and variables aren’t optimized away so they can be inspected).
Another specifier of a product install is the “flavor”. This refers to the operating system the program was compiled for. These days we only support SL7, but in the past we used to also support SL6 and various versions of macOS. The flavor is automatically selected when you set up a product using setup (unless you override it which is usually a bad idea). Some product are “unflavored” because they do not contain anything that depends on the operating system. Examples are products that only contain data files or text files.
Setting up a UPS product defines many environment variables. Most products have an environment variable of the form <productname>_DIR
, where <productname>
is the name of the UPS product in all capital letters. This is the top-level directory and can be used when searching for installed source code or fcl files for example. <productname>_FQ_DIR
is the one that specifies a particular qualifier and flavor.
Exercise 3
- show all the versions of dunetpc that are currently available by using the “ups list -aK+ dunetpc” command
- pick one version and substitute that for DUNETPC_VERSION above and set up dunetpc
Many products modify the following search path variables, prepending their pieces when set up. These search paths are needed by art jobs.
PATH
: colon-separated list of directories the shell uses when searching for programs to execute when you type their names at the command line. The command “which” tells you which version of a program is found first in the PATH search list. Example:
which lar
will tell you where the lar command you would execute is if you were to type “lar” at the command prompt.
The other paths are needed by art for finding plug-in libraries, fcl files, and other components, like gdml files.
CET_PLUGIN_PATH
LD_LIBRARY_PATH
FHICL_FILE_PATH
FW_SEARCH_PATH
Also the PYTHONPATH describes where Python modules will be loaded from.
UPS basic commands
Command | Action |
---|---|
ups list -aK+ dunetpc |
List the versions and flavors of dunetpc that exist on this node |
ups active |
Displays what has been setup |
ups depend dunetpc v08_57_00 -q e19:prof:py2 |
Displays the dependencies for this version of dunetpc |
Exercise 4
- show all the dependencies of dunetpc by using “ups depend dunetpc $DUNETPC_VERSION -q e19:prof”
UPS Docmentation Links
mrb
What is mrb and why do we need it?
Early on, the LArSoft team chose git and cmake as the software version manager and the build language, respectively, to keep up with industry standards and to take advantage of their new features. When we clone a git repository to a local copy and check out the code, we end up building it all. We would like LArSoft and DUNE code to be more modular, or at least the builds should reflect some of the inherent modularity of the code.
Ideally, we would like to only have to recompile a fraction of the software stack when we make a change. The granularity of the build in LArSoft and other art-based projects is the repository. So LArSoft and DUNE have divided code up into multiple repositories (DUNE ought to divide more than it has, but there are a few repositories already with different purposes). Sometimes one needs to modify code in multiple repositories at the same time for a particular project. This is where mrb comes in.
mrb stands for “multi-repository build”. mrb has features for cloning git repositories, setting up build and local products environments, building code, and checking for consistency (i.e. there are not two modules with the same name or two fcl files with the same name). mrb builds UPS products – when it installs the built code into the localProducts directory, it also makes the necessasry UPS table files and .version directories. mrb also has a tool for making a tarball of a build product for distribution to the grid. The software build example later in this tutorial exercises some of the features of mrb.
Command | Action |
---|---|
mrb --help |
prints list of all commands with brief descriptions |
mrb \<command\> --help |
displays help for that command |
mrb gitCheckout |
clone a repository into working area |
mrbsetenv |
set up build environment |
mrb build -jN |
builds local code with N cores |
mrb b -jN |
same as above |
mrb install -jN |
installs local code with N cores |
mrb i -jN |
same as above (this will do a build also) |
mrbslp |
set up all products in localProducts… |
mrb z |
get rid of everything in build area |
Link to the mrb reference guide
Exercise 5
There is no exercise 5. mrb example exercises will be covered in Friday morning’s session as any useful exercise with mrb takes more than 30 minutes on its own. Everyone gets 100% credit for this exercise!
CVMFS
What is CVMFS and why do we need it?
DUNE has a need to distribute precompiled code to many different computers that collaborators may use. Installed products are needed for four things:
- Running programs interactively
- Running programs on grid nodes
- Linking programs to installed libraries
- Inspection of source code and data files
Results must be reproducible, so identical code and associated files must be distributed everywhere. DUNE does not own any batch resources – we use CPU time on computers that participating institutions donate to the Open Science Grid. We are not allowed to install our software on these computers and must return them to their original state when our programs finish running so they are ready for the next job from another collaboration.
CVMFS is a perfect tool for distributing software and related files. It stands for CernVM File System (VM is Virtual Machine). Local caches are provided on each target computer, and files are accessed via the /cvmfs
mount point. DUNE software is in the directory /cvmfs/dune.opensciencegrid.org
, and LArSoft code is in /cvmfs/larsoft.opensciencegrid.org
. These directories are auto-mounted and need to be visible when one executes ls /cvmfs
for the first time. Some software is also in /cvmfs/fermilab.opensciencegrid.org.
CVMFS also provides a de-duplication feature. If a given file is the same in all 100 releases of dunetpc, it is only cached and transmitted once, not independently for every release. So it considerably decreases the size of code that has to be transferred.
When a file is accessed in /cvmfs
, a daemon on the target computer wakes up and determines if the file is in the local cache, and delivers it if it is. If not, the daemon contacts the CVMFS repository server responsible for the directory, and fetches the file into local cache. In this sense, it works a lot like AFS. But it is a read-only filesystem on the target computers, and files must be published on special CVMFS publishing servers. Files may also be cached in a layer between the CVMFS host and the target node in a squid server, which helps facilities with many batch workers reduce the network load in fetching many copies of the same file, possibly over an international connection.
CVMFS also has a feature known as “Stashcache” or “xCache”. Files that are in /cvmfs/dune.osgstorage.org are not actually transmitted in their entirety, only pointers to them are, and then they are fetched from one of several regional cache servers or in the case of DUNE from Fermilab dCache directly. DUNE uses this to distribute photon library files, for instance.
CVMFS is by its nature read-all so code is readable by anyone in the world with a CVMFS client. CVMFS clients are available for download to desktops or laptops. Sensitive code can not be stored in CVMFS.
More information on CVMFS is available here
Exercise 6
- cd /cvmfs and do an ls at top level
- What do you see–do you see the four subdirectories (dune.opensciencegrid.org, larsoft.opensciencegrid.org, fermilab.opensciencegrid.org, and dune.osgstorage.org)
- cd dune.osgstorage.org/pnfs/fnal.gov/usr/dune/persistent/stash/PhotonPropagation/LibraryData
Useful links to bookmark
- Official dataset definitions: dune-data.fnal.gov
- UPS reference manual
- UPS documentation (redmine)
- UPS qualifiers: About Qualifiers (redmine)
- mrb reference guide (redmine)
- CVMFS on DUNE wiki: Access files in CVMFS
Key Points
SAM and Rucio are data handling systems used by the DUNE collaboration to retrieve data.
Staging is a necessary step to make sure files are on disk in dCache (as opposed to only on tape).
Xrootd allows user to stream data file.
The Unix Product Setup (UPS) is a tool to ensure consistency between different software versions and reproducibility.
The multi-repository build (mrb) tool allows code modification in multiple repositories, which is relevant for a large project like LArSoft with different cases (end user and developers) demanding consistency between the builds.
CVMFS distributes software and related files without installing them on the target computer (using a VM, Virtual Machine).
SAM by Schellman
Overview
Teaching: 5 min
Exercises: 0 minQuestions
What event information can be queried for a given data file?
Objectives
Learn about the utility of SAM
Practice selected SAM commands
Notes on the SAM data catalog system
These notes are provided as an ancilliary resource on the topic of DUNE data mananagement by Dr. Heidi Schellman 1-28-2020, and updated Dec. 2021
Introduction
SAM is a data catalog originally designed for the D0 and CDF high energy physics experiments at Fermilab. It is now used by most of the Intensity Frontier experiments at Fermilab.
The most important objects cataloged in SAM are individual files and collections of files called datasets.
Data files themselves are not stored in SAM, their metadata is, and that metadata allows you to search for and find the actual physical files. SAM also provides mechanisms for initiating and tracking file delivery through projects.
General considerations
SAM was designed to ensure that large scale data-processing was done completely and accurately which leads to some features not always present in a generic catalog but very desirable if one wishes high standards of reproducibility and documentation in data analysis.
For example, at the time of the original design, the main storage medium was 8mm tapes using consumer-grade drives. Drive and tape failure rates were > 1%. Several SAM design concepts, notably luminosity blocks and parentage tracking, were introduced to allow accurate tracking of files and their associated normalization in a high error-rate environment.
- Description of the contents of data collections to allow later retrieval
- Tracking of object and collection parentage and description of processing transformations to document the full provenance of any data object and ensure accurate normalization
- Grouping of objects and collection into larger “datasets” based on their characteristics
- Storing physical location of objects
- Tracking of the processing of collections to allow reprocessing on failure and avoid double processing.
- Methods (“projects”) for delivering and tracking collections in multi-processing jobs
- Preservation of data about processing/storage for debugging/reporting
The first 3 goals relate to content and characteristics while the last 3 relate to data storage and processing tools.
Specifics
-
The current SAM implementation uses the file as the basic unit of information. Metadata is associated with the file name. Filenames must be unique in the system. This prevents duplication of data in a sample, as a second copy cannot be cataloged. This makes renaming a file very unwise. A very common practice is to include some of the metadata in the filename, both to make it easier to identify and to ensure uniqueness.
-
Metadata for a file can include file locations but does not have to. A file can have no location at all, or many. When you move or remove a file with an associated SAM location, you need to update the location information.
-
SAM does not move files.. It provides location information for a process to use in streaming or copying a file using its own methods. Temporary locations (such as on a grid node) need not be reported to SAM. Permanently storing or removing files requires both moving/removing the file itself and updating its metadata to reflect that location and is generally left up to special packages such as the Fermilab FTS (File Transfer Service) and SAM projects.
-
Files which are stored on disk or tape are expected to have appropriate file sizes and checksums. One can have duplicate instances of a file in different locations, but they must all be identical. If one reprocesses an input file, the output may well be subtly different (for example dates stored in the file itself can change the checksum). SAM should force you to choose which version, old or new, is acceptable. It will not let you catalog both with the same filename. As a result, if you get a named file out of SAM, you can be reasonably certain you got the right copy.
-
files with duplicate content but different names can be problematic. The reprocessed file mentioned in part 4, if renamed, could cause significant problems if it were allowed into the data sample along with the originals as a job processing all files might get both copies. This is one of the major reasons for the checksums and unique filenames. There is a temptation to put, for example, timestamps, in filenames to generate unique names but that removes a protection against duplication.
-
Files can have parents and children and, effectively, birth certificates that can tell you how they were made. An example would be a set of raw data files RAWnnn processed with code X to produce a single reconstructed file RECO. One can tell SAM that RAWnnn are the parents of RECO processed with version x of code X. If one later finds another RAWnnn file that was missed in processing, SAM can tell you it has not been processed yet with X (i.e., it has no children associated with version x of X) and you can then choose to process that file. This use case often occurs when a production job fails without reporting back or succeeds but the copy back or catalog action fails. Note: The D0 experiment required that all official processing going into SAM be done with tagged releases and fixed parameter sets to increase reproducibility and the tags for that information were included in the metadata. Calibration databases were harder to timestamp so some variability was still possible if calibrations were updated.
-
SAM supports several types of data description fields:
Values are standard across all implementations like run_type, file_size …
Parameters are defined by the experiment for example MC.Genieversion
Values are common to almost all HEP experiments and are optimized for efficient queries. SAM also allows definition of “parameters” (by administrators) as they are needed. This allows the schema to be modified easily as needs arrive.
-
Metadata can also contain “spill” or luminosity block information that allows a file to point to specific data taking periods with smaller granularity than a run or subrun. When files are merged, this spill information is also merged.
-
SAM currently does not contain a means of easily determining which file a given event is in. If a daq system is writing multiple streams, an event from a given subrun could be in any stream. Adding an event database would be a useful feature.
All of these features are intended to assure that your data are well described and can be found. As SAM stores full location information, this means any SAM-visible location. In addition, if parentage information is provided, you can determine and reproduce the full provenance of any file.
Datasets and projects
Datasets
In addition to the files themselves, SAM allows you to define datasets.
A SAM dataset is not a fixed list of files but a query against the SAM database. An example would be “data_tier reconstructed and run_number 2001 and version v10” which would be all files from run 2001 that are reconstructed data produced by version v10. This dataset is dynamic. If one finds a missing file from run 2001 and reconstructs it with v10, the dataset will grow. There are also dataset snapshots that are derived from datasets and capture the exact files in the dataset when the snapshot was made. Note: most other data catalogs assume a “dataset” is a fixed list of files. This is a “snapshot” in SAM.
samweb – samweb
is the command line and python API that allows queries of the SAM metadata, creation of datasets and tools to track and deliver information to batch jobs.
samweb can be acquired from ups via
setup samweb_client
Or installed locally via
git clone http://cdcvs.fnal.gov/projects/sam-web-client
You then need to do something like:
export PATH=$HOME/sam-web-client/bin:${PATH}
export PYTHONPATH=$HOME/sam-web-client/python:${PYTHONPATH}
export SAM_EXPERIMENT=dune
Projects
SAM also supports access tracking mechanisms called projects and consumers. These are generally implemented for you by grid processing scripts. Your job is to choose a dataset and then ask the processing system to launch a project for that dataset.
A project is effectively a processing campaign across a dataset which is owned by the SAM system. At launch a snapshot is generated and then the files in the snapshot are delivered to a set of consumers. The project maintains an internal record of the status of the files and consumers. Each grid process can instantiate a consumer which is attached to the project. Those consumers then request “files” from the project and, when done processing, tell the project of their status.
The original SAM implementation actually delivered the files to local hard drives. Modern SAM delivers the location information and expects the consumer to find the optimal delivery method. This is a pull model, where the consuming process requests the next file rather than having the file assigned to it. This makes the system more robust on distributed systems.
See running projects here.
Accessing the database in read mode
Checking the database does not require special privileges but storing files and running projects modifies the database and requires authentication to the right experimental group. kx509
authentication and membership in the experiment VO are needed.
Administrative actions like adding new values are restricted to a small set of superusers for each experiment.
Suggestions for configuring SAM (for admins)
First of all, it really is nice to have filenames and dataset names that tell you what’s in the box, although not required. The D0 and MINERvA conventions have been to use “_” underscores between useful key strings. As a result, D0 and MINERvA tried not to use “_” in metadata entries to allow cleaner parsing. “-“ is used if needed in the metadata.
D0 also appended processing information to filenames as they moved through the system to assure that files run through different sequences had unique identifiers.
Example: A Monte Carlo simulation file generated with version v3 and then reconstructed with v5 might look like
SIM_MC_020000_0000_simv3.root would be a parent of RECO_MC_020000_0000_simv3_recov5.root
Data files are all children of the raw data while simulation files sometimes have more complicated ancestry, with both unique generated events and overlay events from data as parents.
Setting up SAM metadata (For admins)
This needs to be done once, and very carefully, early in the experiment. It can grow but thinking hard at the beginning saves a lot of pain later.
You need to define data_tiers. These represent the different types of data that you produced through your processing chain. Examples would be raw
, pedsup
, calibrated
, reconstructed
, thumbnail
, mc-generated
, mc-geant
, mc-overlaid
.
run_type
can be used to support multiple DAQ instances.
data_stream
is often used for trigger subsamples that you may wish to split data into (for example pedestal vs data runs).
Generally, you want to store data from a given data_tier with other data from that tier to facilitate fast sequential access.
Applications
It is useful, but not required to also define applications which are triads of “appfamily”, “appname” and “version”. Those are used to figure out what changed X to Y. There are also places to store the machine the application ran on and the start and end time for the job.
The query:
samweb list-files "data_tier raw and not isparentof: (data_tier reconstructed and appname reco and version 7)"
Should, in principle, list raw data files not yet processed by version 7 of reco to produce files of tier reconstructed. You would use this to find lost files in your reconstruction after a power outage.
It is good practice to also store the name of the head application configuration file for processing but this does not have a standard “value.”
Example metadata from DUNE
Here are some examples of querying sam to get file information
$ samweb get-metadata np04_raw_run005141_0015_dl10_reco_12736632_0_20181028T182951.root –json
{
"file_name": "np04_raw_run005141_0015_dl10_reco_12736632_0_20181028T182951.root",
"file_id": 7352771,
"create_date": "2018-10-29T14:59:42+00:00",
"user": "dunepro",
"update_date": "2018-11-28T17:07:30+00:00",
"update_user": "schellma",
"file_size": 14264091111,
"checksum": [
"enstore:1390300706",
"adler32:e8bf4e23"
],
"content_status": "good",
"file_type": "detector",
"file_format": "artroot",
"data_tier": "full-reconstructed",
"application": {
"family": "art",
"name": "reco",
"version": "v07_08_00_03"
},
"event_count": 108,
"first_event": 21391,
"last_event": 22802,
"start_time": "2018-10-28T17:34:58+00:00",
"end_time": "2018-10-29T14:55:42+00:00",
"data_stream": "physics",
"beam.momentum": 7.0,
"data_quality.online_good_run_list": 1,
"detector.hv_value": 180,
"DUNE_data.acCouple": 0,
"DUNE_data.calibpulsemode": 0,
"DUNE_data.DAQConfigName": "np04_WibsReal_Ssps_BeamTrig_00021",
"DUNE_data.detector_config": "cob2_rce01:cob2_rce02:cob2.. 4 more lines of text",
"DUNE_data.febaselineHigh": 2,
"DUNE_data.fegain": 2,
"DUNE_data.feleak10x": 0,
"DUNE_data.feleakHigh": 1,
"DUNE_data.feshapingtime": 2,
"DUNE_data.inconsistent_hw_config": 0,
"DUNE_data.is_fake_data": 0,
"runs": [
[
5141,
1,
"protodune-sp"
]
],
"parents": [
{
"file_name": "np04_raw_run005141_0015_dl10.root",
"file_id": 6607417
}
]
}
$ samweb get-file-access-url np04_raw_run005141_0015_dl10_reco_12736632_0_20181028T182951.root
gsiftp://eospublicftp.cern.ch/eos/experiment/neutplatform/protodune/rawdata/np04/output/detector/full-reconstructed/07/35/27/71/np04_raw_run005141_0015_dl10_reco_12736632_0_20181028T182951.root
gsiftp://fndca1.fnal.gov:2811/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/protodune/np04/beam/output/detector/full-reconstructed/07/35/27/71/np04_raw_run005141_0015_dl10_reco_12736632_0_20181028T182951.root
$ samweb file-lineage children np04_raw_run005141_0015_dl10.root
np04_raw_run005141_0015_dl10_reco_12736632_0_20181028T182951.root
$ samweb file-lineage parents
np04_raw_run005141_0015_dl10_reco_12736632_0_20181028T182951.root
np04_raw_run005141_0015_dl10.root
Merging and splitting (for experts)
Parentage works pretty well if one is merging files but splitting them can become problematic as it makes the parentage structure pretty complex. SAM will let you merge files with different attributes if you don’t check carefully. Generally, it is a good idea not to merge files from different data tiers and certainly not from different data_types. Merging across major processing versions should also be avoided.
Example: Execute samweb Commands
There is documentation at here and here.
This exercise will start you accessing data files that have been defined to the DUNE Data Catalog. Execute the following commands after logging in to the DUNE interactive node, creating the directories above - Once per session
setup sam_web_client #(or set up your standalone version)
export SAM_EXPERIMENT=dune
Then if curious about a file:
samweb locate-file np04_raw_run005141_0001_dl7.root
this will give you output that looks like
rucio:protodune-sp
enstore:/pnfs/dune/tape_backed/dunepro/protodune/np04/beam/detector/None/raw/06/60/59/05(596@vr0072m8)
castor:/neutplatform/protodune/rawdata/np04/detector/None/raw/06/60/59/05
cern-eos:/eos/experiment/neutplatform/protodune/rawdata/np04/detector/None/raw/06/60/59/05
which are the locations of the file on disk and tape. We can use this to copy the file from tape to our local disk. Better yet, you can use xrootd to access the file without copying it if it is staged to disk. Find the xrootd uri via
samweb get-file-access-url np04_raw_run005141_0001_dl7.root --schema=root
root://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/protodune/np04/beam/detector/None/raw/06/60/59/05 /np04_raw_run005141_0001_dl7.root
root://castorpublic.cern.ch//castor/cern.ch/neutplatform/protodune/rawdata/np04/detector/None/raw/06/60/59/05/np04_raw_run005141_0001_dl7.root
root://eospublic.cern.ch//eos/experiment/neutplatform/protodune/rawdata/np04/detector/None/raw/06/60/59/05/np04_raw_run005141_0001_dl7.root
You can localize your file with the --location
argument (enstore, castor, cern-eos)
samweb get-file-access-url np04_raw_run005141_0001_dl7.root --schema=root --location=enstore
root://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/protodune/np04/beam/detector/None/raw/06/60/59/05 /np04_raw_run005141_0001_dl7.root
samweb get-file-access-url np04_raw_run005141_0001_dl7.root --schema=root --location=cern-eos
root://eospublic.cern.ch//eos/experiment/neutplatform/protodune/rawdata/np04/detector/None/raw/06/60/59/05/np04_raw_run005141_0001_dl7.root
To get SAM metadata for a file for which you know the name:
samweb get-metadata np04_raw_run005141_0001_dl7.root
add the --json
option to get output in json format
To list raw data files for a given run:
samweb list-files "run_number 5141 and run_type protodune-sp and data_tier raw"
What about a reconstructed version?
samweb list-files "run_number 5141 and run_type protodune-sp and data_tier full-reconstructed and version (v07_08_00_03,v07_08_00_04)"
Gives a list of files from the first production like
np04_raw_run005141_0001_dl7_reco_12736115_0_20181028T165152.root
We also group reconstruction versions into Campaigns like PDSPProf4
samweb list-files "run_number 5141 and run_type protodune-sp and data_tier full-reconstructed and DUNE.campaign PDSPProd4"
Gives more recent files like:
np04_raw_run005141_0009_dl1_reco1_18126423_0_20210318T102429Z.root
samweb allows you to select on a lot of parameters
Useful ProtoDUNE samweb parameters can be found here and here; these list some official dataset definitions.
You can make your own samweb dataset definitions: First, make certain a definition does not already exist that satisfies your needs by checking the official pages above.
Then check to see what you will get:
samweb list-files "data_tier full-reconstructed and DUNE.campaign PDSPProd4 and data_stream cosmics and run_type protodune-sp and detector.hv_value 180" –summary
samweb create-definition $USER-PDSPProd4_good_cosmics "data_tier full-reconstructed and DUNE.campaign PDSPProd4 and data_stream cosmics and run_type protodune-sp and detector.hv_value 180"
Note that the username
appears in the definition name - to prevent users from getting confused with official samples, your user name is required in the definition name.
prestaging
At CERN files are either on eos or castor. At FNAL they can be on tape_backed dcache
which may mean they are on tape and may need to be prestaged to disk before access.
setup fife_utils # a new version we requested
check to see if a file is on tape or disk
sam_validate_dataset --locality --file np04_raw_run005141_0015_dl10.root --location=/pnfs/ --stage_status
Staging status for: file np04_raw_run005141_0015_dl10.root
Total Files: 1
Tapes spanned: 1
Percent files on disk: 0%
Percent bytes online DCache: 0%
locality counts:
ONLINE: 0
NEARLINE: 1
NEARLINE_size: 8276312581
Oops - this one is not on disk
returns ONLINE_AND_NEARLINE: 1 if available on disk
sam_validate_dataset --locality --name=schellma-1GeVMC-test --stage_status --location=/pnfs/
Staging status for: defname:schellma-1GeVMC-test
Total Files: 140
Tapes spanned: 10
Percent files on disk: 100%
Percent bytes online DCache: 100%
locality counts:
ONLINE: 0
ONLINE_AND_NEARLINE: 140
ONLINE_AND_NEARLINE_size: 270720752891
No ONLINE_NEARLINE
means you need to prestage that file. Unfortunately, prestaging requires a definition. Let’s find some for run 5141. Your physics group should already have some defined.
The official Protodune dataset definitions are here.
samweb describe-definition PDSPProd4a_MC_1GeV_reco1_sce_datadriven_v1_00
Is simulation for 10% of the total sample
Gives this description:
samweb describe-definition PDSPProd4a_MC_1GeV_reco1_sce_datadriven_v1_00
Definition Name: PDSPProd4a_MC_1GeV_reco1_sce_datadriven_v1_00
Definition Id: 635109
Creation Date: 2021-08-02T16:57:20+00:00
Username: dunepro
Group: dune
Dimensions: run_type 'protodune-sp' and file_type mc and data_tier 'full-reconstructed' and dune.campaign PDSPProd4a and dune_mc.beam_energy 1 and
dune_mc.space_charge yes and dune_mc.generators beam_cosmics and version v09_17_01 and run_number in 18800650,.....
samweb list-files "defname:PDSPProd4a_MC_1GeV_reco1_sce_datadriven_v1_00
> " --summary
File count: 5025
Total size: 9683195368818
Event count: 50250
samweb prestage-dataset --def=PDSPProd4a_MC_1GeV_reco1_sce_datadriven_v1_00 --parallel=10
would prestage all of the reconstructed data for run 5141 and you can check on the status by going here and scrolling down to see your prestage link.
At CERN
You can find local copies of files at CERN for interactive use.
samweb list-file-locations --defname=runset-5141-raw-180kV-7GeV-v0 --schema=root --filter_path=castor
gives you:
root://castorpublic.cern.ch//castor/cern.ch/neutplatform/protodune/rawdata/np04/detector/None/raw/06/60/74/16/np04_raw_run005141_0015_dl3.root castor:/neutplatform/protodune/rawdata/np04/detector/None/raw/06/60/74/16 np04_raw_run005141_0015_dl3.root 8289321123
root://castorpublic.cern.ch//castor/cern.ch/neutplatform/protodune/rawdata/np04/detector/None/raw/06/60/74/17/np04_raw_run005141_0015_dl10.root castor:/neutplatform/protodune/rawdata/np04/detector/None/raw/06/60/74/17 np04_raw_run005141_0015_dl10.root 8276312581
Key Points
SAM is a data catalog originally designed for the D0 and CDF experiments at FNAL and is now used widely by HEP experiments.
Quiz on Storage Spaces and Data Management
Overview
Teaching: 5 min
Exercises: 0 minQuestions
Do you understand the storage spaces and data management principles?
Objectives
Validate your understanding by working through and discussing answers to several questions.
Quiz Time
For this training, the quiz for storage spaces and data management can be found here.
Participants were (are) encouraged to work through these as homework.
A Live document can be used for questions, or our Slack channel for this event.
The solutions to this quiz can be found as a PDF here.
Key Points
Practice makes perfect.
Grid Job Submission and Common Errors
Overview
Teaching: 75 min
Exercises: 0 minQuestions
How to submit grid jobs?
Objectives
Submit a job and understand what’s happening behind the scenes
Monitor the job and look at its outputs
Review best practices for submitting jobs (including what NOT to do)
Extension; submit a small job with POMS
Video Session
The grid job submission portion from the December 2021 training was captured on video, and is provided here.
Submit a job
Note that job submission requires FNAL account but can be done from a CERN machine, or any other with CVMFS access.
First, log in to a dunegpvm
machine (should work from lxplus
too with a minor extra step of getting a Fermilab Kerberos ticket on lxplus
via kinit
). Then you will need to set up the job submission tools (jobsub
). If you set up dunetpc
it will be included, but if not, you need to do
source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh
setup jobsub_client
Having done that, let us submit a prepared script:
jobsub_submit -G dune -M -N 1 --memory=1000MB --disk=1GB --cpu=1 --expected-lifetime=1h --resource-provides=usage_model=DEDICATED,OPPORTUNISTIC,OFFSITE -l '+SingularityImage=\"/cvmfs/singularity.opensciencegrid.org/fermilab/fnal-wn-sl7:latest\"' --append_condor_requirements='(TARGET.HAS_CVMFS_dune_opensciencegrid_org==true&&TARGET.HAS_CVMFS_larsoft_opensciencegrid_org==true&&TARGET.CVMFS_dune_opensciencegrid_org_REVISION>=1105)' file:///dune/app/users/kherner/submission_test_singularity.sh
If all goes well you should see something like this:
/fife/local/scratch/uploads/dune/kherner/2021-01-20_120444.002077_4240
/fife/local/scratch/uploads/dune/kherner/2021-01-20_120444.002077_4240/submission_test_singularity.sh_20210120_120445_308543_0_1_.cmd
submitting....
Submitting job(s).
1 job(s) submitted to cluster 40351757.
JobsubJobId of first job: 40351757.0@jobsub01.fnal.gov
Use job id 40351757.0@jobsub01.fnal.gov to retrieve output
Quiz
- What is your job ID?
Now, let’s look at some of these options in more detail.
-M
sends mail after the job completes whether it was successful for not. The default is email only on error. To disable all emails, use--mail_never
.-N
controls the number of identical jobs submitted with each cluster. Also called the process ID, the number ranges from 0 to N-1 and forms the part of the job ID number after the period, e.g. 12345678.N.--memory, --disk, --cpu, --expected-lifetime
request this much memory, disk, number of cpus, and max run time. Jobs that exceed the requested amounts will go into a held state. Defaults are 2000 MB, 10 GB, 1, and 8h, respectively. Note that jobs are charged against the DUNE FermiGrid quota according to the greater of memory/2000 MB and number of CPUs, with fractional values possible. For example, a 3000 MB request is charged 1.5 “slots”, and 4000 MB would be charged 2. You are charged for the amount requested, not what is actually used, so you should not request any more than you actually need (your jobs will also take longer to start the more resources you request). Note also that jobs that run offsite do NOT count against the FermiGrid quota. In general, aim for memory and run time requests that will cover 90-95% of your jobs and use the autorelease feature to deal with the remainder.--resource-provides=usage_model
This controls where jobs are allowed to run. DEDICATED means use the DUNE FermiGrid quota, OPPORTUNISTIC means use idle FermiGrid resources beyond the DUNE quota if they are available, and OFFSITE means use non-Fermilab resources. You can combine them in a comma-separated list. In nearly all cases you should be setting this to DEDICATED,OPPORTUNISTIC,OFFSITE. This ensures maximum resource availability and will get your jobs started the fastest. Note that because of Singularity, there is absolutely no difference between the environment on Fermilab worker nodes and any other place. Depending on where your input data are (if any), you might see slight differences in network latency, but that’s it.-l
(or--lines=
) allows you to pass additional arbitrary HTCondor-styleclassad
variables into the job. In this case, we’re specifying exactly whatSingularity
image we want to use in the job. It will be automatically set up for us when the job starts. Any other valid HTCondorclassad
is possible. In practice you don’t have to do much beyond theSingularity
image. Here, pay particular attention to the quotes and backslashes.--append_condor_requirements
allows you to pass additionalHTCondor-style
requirements to your job. This helps ensure that your jobs don’t start on a worker node that might be missing something you need (a corrupt or out of dateCVMFS
repository, for example). Some checks run at startup for a variety ofCVMFS
repositories. Here, we check that Singularity invocation is working and that theCVMFS
repos we need ( dune.opensciencegrid.org and larsoft.opensciencegrid.org ) are in working order. Optionally you can also place version requirements on CVMFS repos (as we did here as an example), useful in case you want to use software that was published very recently and may not have rolled out everywhere yet.
Job Output
This particular test writes a file to /pnfs/dune/scratch/users/<username>/job_output_<id number>.log
.
Verify that the file exists and is non-zero size after the job completes.
You can delete it after that; it just prints out some information about the environment.
More information about jobsub
is available here and here.
Submit a job using the tarball containing custom code (left as an exercise)
First off, a very important point: for running analysis jobs, you may not actually need to pass an input tarball, especially if you are just using code from the base release and you don’t actually modify any of it. All you need to do is set up any required software from CVMFS (e.g. dunetpc and/or protoduneana), and you are ready to go. If you’re just modifying a fcl file, for example, but no code, it’s actually more efficient to copy just the fcl(s) your changing to the scratch directory within the job, and edit them as part of your job script (copies of a fcl file in the current working directory have priority over others by default).
Sometimes, though, we need to run some custom code that isn’t in a release. We need a way to efficiently get code into jobs without overwhelming our data transfer systems. We have to make a few minor changes to the scripts you made in the previous tutorial section, generate a tarball, and invoke the proper jobsub options to get that into your job. There are many ways of doing this but by far the best is to use the Rapid Code Distribution Service (RCDS), as shown in our example.
If you have finished up the LArSoft follow-up and want to use your own code for this next attempt, feel free to tar it up (you don’t need anything besides the localProducts* and work directories) and use your own tar ball in lieu of the one in this example. You will have to change the last line with your own submit file instead of the pre-made one.
First, we should make a tarball. Here is what we can do (assuming you are starting from /dune/app/users/username/):
cp /dune/app/users/kherner/setupMay2021Tutorial-grid.sh /dune/app/users/username/
cp /dune/app/users/kherner/dec2021tutorial/localProducts_larsoft__e19_prof/setup-grid /dune/app/users/username/dec2021tutorial/localProducts_larsoft__e19_prof/setup-grid
Before we continue, let’s examine these files a bit. We will source the first one in our job script, and it will set up the environment for us.
#!/bin/bash
DIRECTORY=may2021tutorial
# we cannot rely on "whoami" in a grid job. We have no idea what the local username will be.
# Use the GRID_USER environment variable instead (set automatically by jobsub).
USERNAME=${GRID_USER}
source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh
export WORKDIR=${_CONDOR_JOB_IWD} # if we use the RCDS the our tarball will be placed in $INPUT_TAR_DIR_LOCAL.
if [ ! -d "$WORKDIR" ]; then
export WORKDIR=`echo .`
fi
source ${INPUT_TAR_DIR_LOCAL}/${DIRECTORY}/localProducts*/setup-grid
mrbslp
Now let’s look at the difference between the setup-grid script and the plain setup script. Assuming you are currently in the /dune/app/users/username directory:
diff may2021tutorial/localProducts_larsoft__e19_prof/setup may2021tutorial/localProducts_larsoft__e19_prof/setup-grid
< setenv MRB_TOP "/dune/app/users/<username>/may2021tutorial"
< setenv MRB_TOP_BUILD "/dune/app/users/<username>/may2021tutorial"
< setenv MRB_SOURCE "/dune/app/users/<username>/may2021tutorial/srcs"
< setenv MRB_INSTALL "/dune/app/users/<username>/may2021tutorial/localProducts_larsoft__e19_prof"
---
> setenv MRB_TOP "${INPUT_TAR_DIR_LOCAL}/may2021tutorial"
> setenv MRB_TOP_BUILD "${INPUT_TAR_DIR_LOCAL}/may2021tutorial"
> setenv MRB_SOURCE "${INPUT_TAR_DIR_LOCAL}/may2021tutorial/srcs"
> setenv MRB_INSTALL "${INPUT_TAR_DIR_LOCAL}/may2021tutorial/localProducts_larsoft__e19_prof"
As you can see, we have switched from the hard-coded directories to directories defined by environment variables; the INPUT_TAR_DIR_LOCAL
variable will be set for us (see below).
Now, let’s actually create our tar file. Again assuming you are in /dune/app/users/kherner/may2021tutorial/
:
tar --exclude '.git' -czf may2021tutorial.tar.gz may2021tutorial/localProducts_larsoft__e19_prof may2021tutorial/work setupMay2021Tutorial-grid.sh
Then submit another job (in the following we keep the same submit file as above):
jobsub_submit -G dune -M -N 1 --memory=1800MB --disk=2GB --expected-lifetime=3h --cpu=1 --resource-provides=usage_model=DEDICATED,OPPORTUNISTIC,OFFSITE --tar_file_name=dropbox:///dune/app/users/<username>/dec2021tutorial.tar.gz --use-cvmfs-dropbox -l '+SingularityImage=\"/cvmfs/singularity.opensciencegrid.org/fermilab/fnal-wn-sl7:latest\"' --append_condor_requirements='(TARGET.HAS_Singularity==true&&
TARGET.HAS_CVMFS_dune_opensciencegrid_org==true&&
TARGET.HAS_CVMFS_larsoft_opensciencegrid_org==true&&
TARGET.CVMFS_dune_opensciencegrid_org_REVISION>=1105&&
TARGET.HAS_CVMFS_fifeuser1_opensciencegrid_org==true&&
TARGET.HAS_CVMFS_fifeuser2_opensciencegrid_org==true&&
TARGET.HAS_CVMFS_fifeuser3_opensciencegrid_org==true&&
TARGET.HAS_CVMFS_fifeuser4_opensciencegrid_org==true)' file:///dune/app/users/kherner/run_dec2021tutorial.sh
You’ll see this is very similar to the previous case, but there are some new options:
--tar_file_name=dropbox://
automatically copies and untars the given tarball into a directory on the worker node, accessed via the INPUT_TAR_DIR_LOCAL environment variable in the job. As of now, only one such tarball can be specified. If you need to copy additional files into your job that are not in the main tarball you can use the -f option (see the jobsub manual for details). The value of INPUT_TAR_DIR_LOCAL is by default $CONDOR_DIR_INPUT/name_of_tar_file, so if you have a tar file named e.g. may2021tutorial.tar.gz, it would be $CONDOR_DIR_INPUT/may2021tutorial.--use-cvmfs-dropbox
stages that tarball through the RCDS (really a collection of special CVMFS repositories). As of April 2021, it is now the default method for tarball transfer and the –use-cvmfs-dropbox option is not needed (though it will not hurt to keep if it your submission for now).- Notice that the
--append_condor_requirements
line is longer now, because we also check for the fifeuser[1-4]. opensciencegrid.org CVMFS repositories.
Now, there’s a very small gotcha when using the RCDS, and that is when your job runs, the files in the unzipped tarball are actually placed in your work area as symlinks from the CVMFS version of the file (which is what you want since the whole point is not to have N different copies of everything). The catch is that if your job script expected to be able to edit one or more of those files within the job, it won’t work because the link is to a read-only area. Fortunately there’s a very simple trick you can do in your script before trying to edit any such files:
cp ${INPUT_TAR_DIR_LOCAL}/file_I_want_to_edit mytmpfile # do a cp, not mv
rm ${INPUT_TAR_DIR_LOCAL}file_I_want_to_edit # This really just removes the link
mv mytmpfile file_I_want_to_edit # now it's available as an editable regular file.
You certainly don’t want to do this for every file, but for a handful of small text files this is perfectly acceptable and the overall benefits of copying in code via the RCDS far outweigh this small cost. This can get a little complicated when trying to do it for things several directories down, so it’s easiest to have such files in the top level of your tar file.
Monitor your jobs
For all links below, log in with your FNAL Services credentials (FNAL email, not Kerberos password).
-
What DUNE is doing overall:
https://fifemon.fnal.gov/monitor/d/000000053/experiment-batch-details?orgId=1&var-experiment=dune -
What’s going on with only your jobs:
Remember to change the url with your own username and adjust the time range to cover the region of interest. https://fifemon.fnal.gov/monitor/d/000000116/user-batch-details?orgId=1&var-cluster=fifebatch&var-user=kherner -
Why your jobs are held:
Remember to choose your username in the upper left.
https://fifemon.fnal.gov/monitor/d/000000146/why-are-my-jobs-held?orgId=1
View the stdout/stderr of our jobs
Here’s the link for the history page of the example job: link.
Feel free to sub in the link for your own jobs.
Once there, click “View Sandbox files (job logs)”. In general you want the .out and .err files for stdout and stderr. The .cmd file can sometimes be useful to see exactly what got passed in to your job.
Kibana can also provide a lot of information.
You can also download the job logs from the command line with jobsub_fetchlog:
jobsub_fetchlog --jobid=12345678.0@jobsub0N.fnal.gov --unzipdir=some_appropriately_named_directory
That will download them as a tarball and unzip it into the directory specified by the –unzipdir option. Of course replace 12345678.0@jobsub0N.fnal.gov with your own job ID.
Quiz
Download the log of your last submission via jobsub_fetchlog or look it up on the monitoring pages. Then answer the following questions (all should be available in the .out or .err files):
- On what site did your job run?
- How much memory did it use?
- Did it exit abnormally? If so, what was the exit code?
Brief review of best practices in grid jobs (and a bit on the interactive machines)
- When creating a new workflow or making changes to an existing one, ALWAYS test with a single job first. Then go up to 10, etc. Don’t submit thousands of jobs immediately and expect things to work.
- ALWAYS be sure to prestage your input datasets before launching large sets of jobs.
- Use RCDS; do not copy tarballs from places like scratch dCache. There’s a finite amount of transfer bandwidth available from each dCache pool. If you absolutely cannot use RCDS for a given file, it’s better to put it in resilient (but be sure to remove it when you’re done!). The same goes for copying files from within your own job script: if you have a large number of jobs looking for a same file, get it from resilient. Remove the copy when no longer needed. Files in resilient dCache that go unaccessed for 45 days are automatically removed.
- Be careful about placing your output files. NEVER place more than a few thousand files into any one directory inside dCache. That goes for all type of dCache (scratch, persistent, resilient, etc).
- Avoid commands like
ifdh ls /path/with/wildcards/*/
inside grid jobs. That is a VERY expensive operation and can cause a lot of pain for many users. - Use xrootd when opening files interactively; this is much more stable than simply doing
root /pnfs/dune/...
- NEVER copy job outputs to a directory in resilient dCache. Remember that they are replicated by a factor of 20! Any such files are subject to deletion without warning.
- NEVER do hadd on files in
/pnfs
areas unless you’re usingxrootd
. I.e. do NOT do hadd out.root/pnfs/dune/file1 /pnfs/dune/file2 ...
This can cause severe performance degradations.
(Time permitting) submit with POMS
POMS is the recommended way of submitting large workflows. It offers several advantages over other systems, such as
- Fully configurable. Any executables can be run, not necessarily only lar or art
- Automatic monitoring and campaign management options
- Multi-stage workflow dependencies, automatic dataset creation between stages
- Automated recovery options
At its core, in POMS one makes a “campaign”, which has one or more “stages”. In our example there is only a single stage.
For analysis use: main POMS page
An example campaign.
Typical POMS use centers around a configuration file (often more like a template which can be reused for many campaigns) and various campaign-specific settings for overriding the defaults in the config file.
An example config file designed to do more or less what we did in the previous submission is here: /dune/app/users/kherner/may2021tutorial/work/pomsdemo.cfg
You can find more about POMS here: POMS User Documentation
Helpful ideas for structuring your config files are here: Fife launch Reference
When you start using POMS you must upload an x509 proxy to the sever before submitting. The best way to do that is to set up the poms_client UPS product and then use the upload_file command after you have generated your proxy:
kx509
voms-proxy-init -rfc -noregen -voms dune:/dune/Role=Analysis -valid 120:00
upload_file --experiment dune --proxy
Finally, here is an example of a campaign that does the same thing as the previous one, using our usual MC reco file from Prod2, but does it via making a SAM dataset using that as the input: POMS campaign stage information. Of course, before running any SAM project, we should prestage our input definition(s):
samweb prestage-dataset kherner-may2021tutorial-mc
replacing the above definition with your own definition as appropriate.
If you are used to using other programs for your work such as project.py, there is a helpful tool called Project-py that you can use to convert existing xml into POMS configs, so you don’t need to start from scratch! Then you can just switch to using POMS from that point forward.
Further Reading
Some more background material on these topics (including some examples of why certain things are bad) are on this PDF:
DUNE Computing Tutorial:Advanced topics and best practices
The Glidein-based Workflow Management System
Key Points
When in doubt, ask! Understand that policies and procedures that seem annoying, overly complicated, or unnecessary (especially when compared to running an interactive test) are there to ensure efficient operation and scalability. They are also often the result of someone breaking something in the past, or of simpler approaches not scaling well.
Send test jobs after creating new workflows or making changes to existing ones. If things don’t work, don’t blindly resubmit and expect things to magically work the next time.
Only copy what you need in input tar files. In particular, avoid copying log files, .git directories, temporary files, etc. from interactive areas.
Take care to follow best practices when setting up input and output file locations.
Always, always, always prestage input datasets. No exceptions.
Quiz on Grid Job Submission
Overview
Teaching: 10 min
Exercises: 0 minQuestions
Do you understand grid job submission protocols?
Objectives
Validate your understanding by working through and discussing answers to several questions.
Quiz Time
An invitation for participants to take the quiz on grid job submission was extended by Dr. David, the video provided here.
For this training, the quiz for job submissions can be found here. Participants were (are) encouraged to work through these as homework.
The quiz questions and solutions were discussed for this training, captured on video, and provided here.
A Live document can be used for questions, or our Slack channel for this event.
The solutions to this quiz can be found as a PDF here.
Key Points
Practice makes perfect.
Code-makeover - Submit with POMS
Overview
Teaching: 15 min
Exercises: 0 minQuestions
How to submit grid jobs with POMS?
Objectives
Demonstrate use of POMS for job submission.
Submit with POMS
This lesson extends from earlier work: Grid Job Submission and Common Errors
POMS is the recommended way of submitting large workflows. It offers several advantages over other systems, such as
- Fully configurable. Any executables can be run, not necessarily only lar or art
- Automatic monitoring and campaign management options
- Multi-stage workflow dependencies, automatic dataset creation between stages
- Automated recovery options
At its core, in POMS one makes a “campaign”, which has one or more “stages”. In our example there is only a single stage.
For analysis use: main POMS page
An example campaign.
Typical POMS use centers around a configuration file (often more like a template which can be reused for many campaigns) and various campaign-specific settings for overriding the defaults in the config file.
An example config file designed to do more or less what we did in the previous submission is here: /dune/app/users/kherner/may2021tutorial/work/pomsdemo.cfg
You can find more about POMS here: POMS User Documentation
Helpful ideas for structuring your config files are here: Fife launch Reference
When you start using POMS you must upload an x509 proxy to the sever before submitting (you can just scp your proxy file from a dunegpvm machine) and it must be named x509up_voms_dune_Analysis_yourusername when you upload it. To upload, look for the User Data item in the left-hand menu on the POMS site, choose Uploaded Files, and follow the instructions.
Finally, here is an example of a campaign that does the same thing as the previous one, using our usual MC reco file from Prod2, but does it via making a SAM dataset using that as the input: POMS campaign stage information. Of course, before running any SAM project, we should prestage our input definition(s):
samweb prestage-dataset kherner-may2021tutorial-mc
replacing the above definition with your own definition as appropriate.
If you are used to using other programs for your work such as project.py, there is a helpful tool called Project-py that you can use to convert existing xml into POMS configs, so you don’t need to start from scratch! Then you can just switch to using POMS from that point forward.
Key Points
Always, always, always prestage input datasets. No exceptions.
Closing Remarks
Overview
Teaching: 5 min
Exercises: 0 minQuestions
Are you fortified with enough information to start your event analysis?
Objectives
Reflect on the days of learning.
Closing Remarks
Video Session
The closing remarks for this training were captured on video, and provided here.
Three Days of Training Collapsed into One Half Day
The instruction in this half day workshop is provided by several experienced physicists and is based on years of collaborative experience.
The secure access to Fermilab computing systems and a familiarity with data storage are key components.
Data management and event processing tools were described and modeled.
Protocols for efficient job submission and monitoring has been demonstrated.
We are thankful for the instructor’s hard work, and for the numerous participants who joined.
Survey time!
Please give us feedback on this training by filling out our survey:
It is anonymous. Send us your impressions, praise, critiques and suggestions! Thank you very much.
Next Steps
Video recordings of the sessions have been posted within each lesson.
We invite you to bookmark this training site and to revisit the content regularly.
Point a colleague to the material.
Long Term Support
You made some excellent connections with computing experts and invite your continued dialog.
A DUNE Slack channel (#computing-training-dec2021) will remain available and we encourage your activity in the dialog.
Key Points
The DUNE Computing Consortium has presented this workshop so as to broaden the use of software tools used for analysis.