Data Management
Overview
Teaching: 25 min
Exercises: 0 minQuestions
What are the data management tools and software for DUNE?
How are different software versions handled?
What are the best data management practices?
Objectives
Learn how to access data from DUNE Data Catalog.
Understand the roles of the tools UPS, mrb and CVMFS.
Session Video
The session was video from the training in January 2023 is provided here as a reference.
Live Notes
A archive of the Live Notes is also provided.
Introduction
DUNE data is stored around the world and the storage elements are not always organized in a way that they can be easily inspected. For this purpose we use the SAM web client.
What is SAM?
Sequential Access with Metadata (SAM) is a data handling system developed at Fermilab. It is designed to track locations of files and other file metadata.
This lecture will show you how to access data files that have been defined to the DUNE Data Catalog. Execute the following commands after logging in to the DUNE interactive node, and sourcing the main dune setups.
Once per session:
setup sam_web_client
export SAM_EXPERIMENT=dune
What is Rucio?
Rucio is the next-generation Data Replica service and is part of DUNE’s new Distributed Data Management (DDM) system that is currently in deployment. Rucio has two functions:
- A rule-based system to get files to Rucio Storage Elements around the world and keep them there.
- To return the “nearest” replica of any data file for use either in interactive or batch file use. It is expected that most DUNE users will not be regularly using direct Rucio commands, but other wrapper scripts that calls them indirectly.
As of the date of this January 2023 tutorial:
- The Rucio client is installed in CVMFS.
- “setup rucio” to get it.
- Most DUNE users are not yet enabled to use it. But when we do, some of the commands will look like this:
rucio list-file-replicas protodune-sp:np04_raw_run005801_0001_dl1.root
rucio download protodune-sp:np04_raw_run005801_0001_dl1.root
rucio list-rses
In the early days of Rucio the most common use case will be for locating the nearest replica of a file. That is what the rucio list-file-replicas command given above is for. It will return URI suitable for streaming the file in question. Rucio will be read-only to most users initially. Current plans are that a wrapper will be supplied to users for use both in jobs and interactively to write files. Users should not have to learn the commands that are necessary for uploading to Rucio directly. On the client side Rucio provides both command line tools and a Python API.
Rucio Concepts
A DID is a Data Identifier. Data Identifiers can describe a File, a Dataset (which can contain many files) or a Container (which can contain many datasets). A DID has the form of scope:name
A Replica is a copy of a File in a specific location. Most files have more than one Replica
A Scope (in our implementation) is a collection of related data. Most of our rucio Scopes map to detector type, i.e. protodune-sp, protodune-dp, vd-coldbox-top, hd-coldbox, hd-protodune, etc.
A Rule is an instruction to move a file, dataset, or Container to a specific place and keep it there.
An RSE is a Rucio Storage Element where replicas can be stored.
MetaCat Introduction
Everything that is in DUNE-managed storage must have metadata. Any file in Rucio must have metadata before it can be added. The MetaCat client is available in DUNE CVMFS. Extensive documentation is available at https://metacat.readthedocs.io The MetaCat web GUI is available at https://metacat.fnal.gov:9443/dune_meta_demo/app/gui. Any DUNE user should be able to log in using their user name and password. MetaCat also has the ability to add plugins to jointly search the file database and other DUNE databases such as the conditions database jointly.
Finding data
If you know a given file and want to locate it, e.g.:
samweb locate-file np04_raw_run005758_0001_dl3.root
This will give you output that looks like:
rucio:protodune-sp
cern-eos:/eos/experiment/neutplatform/protodune/rawdata/np04/detector/None/raw/07/42/28/49
castor:/neutplatform/protodune/rawdata/np04/detector/None/raw/07/42/28/49
enstore:/pnfs/dune/tape_backed/dunepro/protodune/np04/beam/detector/None/raw/07/42/28/49(597@vr0337m8)
which is the locations of the file on disk and tape. We can use this to copy the file from tape to our local disk.
To list raw data files for a given run:
samweb list-files "run_number 5758 and run_type protodune-sp and data_tier raw"
np04_raw_run005758_0001_dl3.root
np04_raw_run005758_0002_dl2.root
...
np04_raw_run005758_0065_dl10.root
np04_raw_run005758_0065_dl4.root
What about a reconstructed version?
samweb list-files "run_number 5758 and run_type protodune-sp and data_tier full-reconstructed and version (v07_08_00_03,v07_08_00_04)"
np04_raw_run005758_0053_dl7_reco_12891068_0_20181101T222620.root
np04_raw_run005758_0025_dl11_reco_12769309_0_20181101T213029.root
np04_raw_run005758_0053_dl2_reco_12891066_0_20181101T222620.root
...
np04_raw_run005758_0061_dl8_reco_14670148_0_20190105T175536.root
np04_raw_run005758_0044_dl6_reco_14669100_0_20190105T172046.root
The above is truncated output to show us the one reconstructed file that is the child of the raw data file above.
To see the total number of files that match a certain query expression, then add the --summary
option to samweb list-files
.
samweb
allows you to select on a lot of parameters which are documented here:
-
dune-data.fnal.gov lists some official dataset definitions
Accessing data for use in your analysis
To access data without copying it, XRootD
is the tool to use. However it will work only if the file is staged to the disk.
You can stream files worldwide if you have a DUNE VO certificate as described in the preparation part of this tutorial.
Where is the file?
An example to find a given file:
samweb get-file-access-url np04_raw_run005758_0001_dl3_reco_13600804_0_20181127T081955.root --schema=root
root://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/protodune/np04/beam/output/detector/full-reconstructed/08/61/68/00/np04_raw_run005758_0001_dl3_reco_13600804_0_20181127T081955.root
Resource: Using the SAM Data Catalog.
Exercise 1
- Use the
--location
argument to show the path of the file above on eitherenstore
,castor
orcern-eos
.- Use
get-metadata
to get SAM metadata for this file. Note that--json
gives the output in json format.
When we are analyzing large numbers of files in a group of batch jobs, we use a SAM snapshot to describe the full set of files that we are going to analyze and create a SAM Project based on that. Each job will then come up and ask SAM to give it the next file in the list. SAM has some capability to grab the nearest copy of the file. For instance if you are running at CERN and analyzing this file it will automatically take it from the CERN storage space EOS.
Exercise 2
- use the samweb describe-definition command to see the dimensions of data set PDSPProd4_MC_1GeV_reco1_sce_datadriven_v1
- use the samweb list-definition-files command with the –summary option to see the total size of PDSPProd4_MC_1GeV_reco1_sce_datadriven_v1
- use the samweb take-snapshot command to make a snapshot of PDSPProd4_MC_1GeV_reco1_sce_datadriven_v1
What is UPS and why do we need it?
An important requirement for making valid physics results is computational reproducibility. You need to be able to repeat the same calculations on the data and MC and get the same answers every time. You may be asked to produce a slightly different version of a plot for example, and the data that goes into it has to be the same every time you run the program.
This requirement is in tension with a rapidly-developing software environment, where many collaborators are constantly improving software and adding new features. We therefore require strict version control; the workflows must be stable and not constantly changing due to updates.
DUNE must provide installed binaries and associated files for every version of the software that anyone could be using. Users must then specify which version they want to run before they run it. All software dependencies must be set up with consistent versions in order for the whole stack to run and run reproducibly.
The Unix Product Setup (UPS) is a tool to handle the software product setup operation.
UPS is set up when you setup DUNE:
source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh
This sourcing defines the UPS setup
command. Now to get DUNE’s LArSoft-based software, this is done through:
setup dunesw $DUNESW_VERSION -q e20:prof
dunesw
: product name
$DUNESW_VERSION
version tag
e20:prof
are “qualifiers”. Qualifiers are separated with colons and may be specified in any order. The “e20” qualifier refers to a specific version of the gcc compiler suite, and “prof” means select the installed product that has been compiled with optimizations turned on. An alternative to “prof” is the “debug” qualifier. All builds of LArSoft and dunesw are compiled with debug symbols turned on, but the “debug” builds are made with optimizations turned off. Both kinds of software can be debugged, but it is easier to debug the debug builds (code executes in the proper order and variables aren’t optimized away so they can be inspected).
Another specifier of a product install is the “flavor”. This refers to the operating system the program was compiled for. These days we only support SL7, but in the past we used to also support SL6 and various versions of macOS. The flavor is automatically selected when you set up a product using setup (unless you override it which is usually a bad idea). Some product are “unflavored” because they do not contain anything that depends on the operating system. Examples are products that only contain data files or text files.
Setting up a UPS product defines many environment variables. Most products have an environment variable of the form <productname>_DIR
, where <productname>
is the name of the UPS product in all capital letters. This is the top-level directory and can be used when searching for installed source code or fcl files for example. <productname>_FQ_DIR
is the one that specifies a particular qualifier and flavor.
Exercise 3
- show all the versions of dunesw that are currently available by using the “ups list -aK+ dunesw” command
- pick one version and substitute that for DUNESW_VERSION above and set up dunesw
Many products modify the following search path variables, prepending their pieces when set up. These search paths are needed by art jobs.
PATH
: colon-separated list of directories the shell uses when searching for programs to execute when you type their names at the command line. The command “which” tells you which version of a program is found first in the PATH search list. Example:
which lar
will tell you where the lar command you would execute is if you were to type “lar” at the command prompt.
The other paths are needed by art for finding plug-in libraries, fcl files, and other components, like gdml files.
CET_PLUGIN_PATH
LD_LIBRARY_PATH
FHICL_FILE_PATH
FW_SEARCH_PATH
Also the PYTHONPATH describes where Python modules will be loaded from.
UPS basic commands
Command | Action |
---|---|
ups list -aK+ dunesw |
List the versions and flavors of dunesw that exist on this node |
ups active |
Displays what has been setup |
ups depend dunesw v09_65_01d00 -q e20:prof |
Displays the dependencies for this version of dunesw |
Exercise 4
- show all the dependencies of dunesw by using “ups depend dunesw $DUNESW_VERSION -q e20:prof”
UPS Documentation Links
mrb
What is mrb and why do we need it?
Early on, the LArSoft team chose git and cmake as the software version manager and the build language, respectively, to keep up with industry standards and to take advantage of their new features. When we clone a git repository to a local copy and check out the code, we end up building it all. We would like LArSoft and DUNE code to be more modular, or at least the builds should reflect some of the inherent modularity of the code.
Ideally, we would like to only have to recompile a fraction of the software stack when we make a change. The granularity of the build in LArSoft and other art-based projects is the repository. So LArSoft and DUNE have divided code up into multiple repositories (DUNE ought to divide more than it has, but there are a few repositories already with different purposes). Sometimes one needs to modify code in multiple repositories at the same time for a particular project. This is where mrb comes in.
mrb stands for “multi-repository build”. mrb has features for cloning git repositories, setting up build and local products environments, building code, and checking for consistency (i.e. there are not two modules with the same name or two fcl files with the same name). mrb builds UPS products – when it installs the built code into the localProducts directory, it also makes the necessasry UPS table files and .version directories. mrb also has a tool for making a tarball of a build product for distribution to the grid. The software build example later in this tutorial exercises some of the features of mrb.
Command | Action |
---|---|
mrb --help |
prints list of all commands with brief descriptions |
mrb \<command\> --help |
displays help for that command |
mrb gitCheckout |
clone a repository into working area |
mrbsetenv |
set up build environment |
mrb build -jN |
builds local code with N cores |
mrb b -jN |
same as above |
mrb install -jN |
installs local code with N cores |
mrb i -jN |
same as above (this will do a build also) |
mrbslp |
set up all products in localProducts… |
mrb z |
get rid of everything in build area |
Link to the mrb reference guide
Exercise 5
There is no exercise 5. mrb example exercises will be covered in Friday morning’s session as any useful exercise with mrb takes more than 30 minutes on its own. Everyone gets 100% credit for this exercise!
CVMFS
What is CVMFS and why do we need it?
DUNE has a need to distribute precompiled code to many different computers that collaborators may use. Installed products are needed for four things:
- Running programs interactively
- Running programs on grid nodes
- Linking programs to installed libraries
- Inspection of source code and data files
Results must be reproducible, so identical code and associated files must be distributed everywhere. DUNE does not own any batch resources – we use CPU time on computers that participating institutions donate to the Open Science Grid. We are not allowed to install our software on these computers and must return them to their original state when our programs finish running so they are ready for the next job from another collaboration.
CVMFS is a perfect tool for distributing software and related files. It stands for CernVM File System (VM is Virtual Machine). Local caches are provided on each target computer, and files are accessed via the /cvmfs
mount point. DUNE software is in the directory /cvmfs/dune.opensciencegrid.org
, and LArSoft code is in /cvmfs/larsoft.opensciencegrid.org
. These directories are auto-mounted and need to be visible when one executes ls /cvmfs
for the first time. Some software is also in /cvmfs/fermilab.opensciencegrid.org.
CVMFS also provides a de-duplication feature. If a given file is the same in all 100 releases of dunesw, it is only cached and transmitted once, not independently for every release. So it considerably decreases the size of code that has to be transferred.
When a file is accessed in /cvmfs
, a daemon on the target computer wakes up and determines if the file is in the local cache, and delivers it if it is. If not, the daemon contacts the CVMFS repository server responsible for the directory, and fetches the file into local cache. In this sense, it works a lot like AFS. But it is a read-only filesystem on the target computers, and files must be published on special CVMFS publishing servers. Files may also be cached in a layer between the CVMFS host and the target node in a squid server, which helps facilities with many batch workers reduce the network load in fetching many copies of the same file, possibly over an international connection.
CVMFS also has a feature known as “Stashcache” or “xCache”. Files that are in /cvmfs/dune.osgstorage.org are not actually transmitted in their entirety, only pointers to them are, and then they are fetched from one of several regional cache servers or in the case of DUNE from Fermilab dCache directly. DUNE uses this to distribute photon library files, for instance.
CVMFS is by its nature read-all so code is readable by anyone in the world with a CVMFS client. CVMFS clients are available for download to desktops or laptops. Sensitive code can not be stored in CVMFS.
More information on CVMFS is available here
Exercise 6
- cd /cvmfs and do an ls at top level
- What do you see–do you see the four subdirectories (dune.opensciencegrid.org, larsoft.opensciencegrid.org, fermilab.opensciencegrid.org, and dune.osgstorage.org)
- cd dune.osgstorage.org/pnfs/fnal.gov/usr/dune/persistent/stash/PhotonPropagation/LibraryData
Useful links to bookmark
- Official dataset definitions: dune-data.fnal.gov
- UPS reference manual
- UPS documentation (redmine)
- UPS qualifiers: About Qualifiers (redmine)
- mrb reference guide (redmine)
- CVMFS on DUNE wiki: Access files in CVMFS
Key Points
SAM and Rucio are data handling systems used by the DUNE collaboration to retrieve data.
Staging is a necessary step to make sure files are on disk in dCache (as opposed to only on tape).
Xrootd allows user to stream data file.
The Unix Product Setup (UPS) is a tool to ensure consistency between different software versions and reproducibility.
The multi-repository build (mrb) tool allows code modification in multiple repositories, which is relevant for a large project like LArSoft with different cases (end user and developers) demanding consistency between the builds.
CVMFS distributes software and related files without installing them on the target computer (using a VM, Virtual Machine).