Data Management (2024 updated for metacat/justin/rucio)
Overview
Teaching: 30 min
Exercises: 15 minQuestions
What are the data management tools and software for DUNE?
Objectives
Learn how to access data from DUNE Data Catalog.
Learn a bit about the JustIN workflow system for submitting batch jobs.
Session Video
The session video on December 10, 2025 was captured for your asynchronous review.
Introduction
What we need to do to produce accurate physics results
DUNE has a lot of data which is processed through a complicated chain of steps. We try to abide by FAIR (Findable, Accesible, Intepretable and Reproducible) principles in our use of data.
Our DUNE Physics Analysis Review Procedures state that:
-
Software must be documented, and committed to a repository accessible to the collaboration.
The preferred location is any repository managed within the official DUNE GitHub page: https://github.com/DUNE.
There should be sufficient instructions on how to reproduce the results included with the software. In particular, a good goal is that the working group conveners are able to remake plots, in case cosmetic changes need to be made. Software repositories should adhere to licensing and copyright guidelines detailed in DocDB-27141.
-
Data and simulation samples must come from well-documented, reproducible production campaigns. For most analyses, input samples should be official, catalogued DUNE productions.
How we do it
DUNE offical data samples are produced using released code, cataloged with metadata that describes the processing chain and stored so that they are accessible to collaborators.
DUNE data is stored around the world and the storage elements are not always organized in a way that they can be easily inspected. For this purpose we use the metacat data catalog to describe the data and collections and the rucio file storage system to determine where replicas of files are. There is also a legacy SAM data access system that can be used for older files.
How can I help?
If you want to access data, this module will help you find and examine it.
If you want to process data using the full power of DUNE computing, you should talk to the data management group about methods for cataloging any data files you plan to produce. This will allow you to use DUNE’s collaborative storage capabilities to preserve and share your work with others and will be required for publication of results.
How to find and access official data
What is metacat?
Metacat is a file catalog - it allows you to search for files that have particular attributes and understand their provenance, including details on all of their processing steps. It also allows for querying jointly the file catalog and the DUNE conditions database.
You can find extensive documentation on metacat at:
Find a file in metacat
DUNE runs multiple experiments (far detectors, protodune-sp, protodune-dp hd-protodune, vd-protodune, iceberg, coldboxes… ) and produces various kinds of data (mc/detector) and process them through different phases.
To find your data you need to specify at the minimum
core.run_type
(the experiment)core.file_type
(mc or detecor)core.data_tier
(the level of processing raw, full-reconstructed, root-tuple)
and when searching for specific types of data
core.data_stream
(physics, calibration, cosmics)core.runs[any]=<runnumber>
Here is an example of a metacat query that gets you raw files from a recent ‘hd-protodune’ cosmics run.
Note: there are example setups that do a full setup in the extras folder:
First get metacat if you have not already done so
SL7
# If you have not already done a general SL7 software setup: source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh export DUNELAR_VERSION=v10_00_04d00 export DUNELAR_QUALIFIER=e26:prof setup dunesw $DUNELAR_VERSION -q $DUNELAR_QUALIFIER export METACAT_AUTH_SERVER_URL=https://metacat.fnal.gov:8143/auth/dune export METACAT_SERVER_URL=https://metacat.fnal.gov:9443/dune_meta_prod/app # then you can set up metacat and rucio setup metacat setup rucio
AL9
source /cvmfs/larsoft.opensciencegrid.org/spack-packages/setup-env.sh spack load r-m-dd-config experiment=dune
For both
metacat auth login -m password $USER # use your services password to authenticate
Note: other means of authentication
Check out the metacat documentation for kx509 and token authentication.
then do queries to find particular sets of files.
metacat query "files from dune:all where core.file_type=detector \
and core.run_type=hd-protodune and core.data_tier=raw \
and core.data_stream=cosmics and core.runs[any]=27296 limit 2"
should give you 2 files:
hd-protodune:np04hd_raw_run027296_0000_dataflow3_datawriter_0_20240619T110330.hdf5
hd-protodune:np04hd_raw_run027296_0000_dataflow0_datawriter_0_20240619T110330.hdf5
the string before the ‘:’ is the namespace and the string after is the filename.
You can find out more about your file by doing:
metacat file show -m -l hd-protodune:np04hd_raw_run027296_0000_dataflow3_datawriter_0_20240619T110330.hdf5
which gives you a lot of information:
checksums:
adler32 : 6a191436
created_timestamp : 2024-06-19 11:08:24.398197+00:00
creator : dunepro
fid : 83302138
name : np04hd_raw_run027296_0000_dataflow3_datawriter_0_20240619T110330.hdf5
namespace : hd-protodune
retired : False
retired_by : None
retired_timestamp : None
size : 4232017188
updated_by : None
updated_timestamp : 1718795304.398197
metadata:
core.data_stream : cosmics
core.data_tier : raw
core.end_time : 1718795024.0
core.event_count : 35
core.events : [3, 7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47, 51, 55, 59, 63, 67, 71, 75, 79, 83, 87, 91, 95, 99, 103, 107, 111, 115, 119, 123, 127, 131, 135, 139]
core.file_content_status: good
core.file_format : hdf5
core.file_type : detector
core.first_event_number: 3
core.last_event_number: 139
core.run_type : hd-protodune
core.runs : [27296]
core.runs_subruns : [2729600001]
core.start_time : 1718795010.0
dune.daq_test : False
retention.class : physics
retention.status : active
children:
hd-protodune-det-reco:np04hd_raw_run027296_0000_dataflow3_datawriter_0_20240619T110330_reco_stage1_20240621T175057_keepup_hists.root (eywzUgkZRZ6llTsU)
hd-protodune-det-reco:np04hd_raw_run027296_0000_dataflow3_datawriter_0_20240619T110330_reco_stage1_reco_stage2_20240621T175057_keepup.root (GHSm3owITS20vn69)
look in the glossary to see what those fields mean.
find out how much raw data there is in a run using the summary option
metacat query -s "files from dune:all where core.file_type=detector \
and core.run_type=hd-protodune and core.data_tier=raw \
and core.data_stream=cosmics and core.runs[any]=27296"
Files: 963
Total size: 4092539942264 (4.093 TB)
To look at all the files in that run you need to use XRootD - DO NOT TRY TO COPY 4 TB to your local area!!!*
What is(was) SAM?
Sequential Access with Metadata (SAM) is/was a data handling system developed at Fermilab. It is designed to track locations of files and other file metadata. It has been replaced by the combination of MetaCat and Rucio. New files are not getting declared to SAM anymore. Any SAM locations after June of 2024 should be presumed to be wrong. Still being used in some legacy ProtoDUNE analyses.
What is Rucio?
Rucio is the next-generation Data Replica service and is part of DUNE’s new Distributed Data Management (DDM) system that is currently in deployment. Rucio has two functions:
- A rule-based system to get files to Rucio Storage Elements around the world and keep them there.
- To return the “nearest” replica of any data file for use either in interactive or batch file use. It is expected that most DUNE users will not be regularly using direct Rucio commands, but other wrapper scripts that calls them indirectly.
As of the date of the December 2024 tutorial:
- The Rucio client is available in CVMFS and Spack
- Most DUNE users are now enabled to use it. New users may not automatically be added.
Let’s find a file
If you haven’t already done this earlier in setup
- On sl7 type
setup rucio
- On al9 type
spack load rucio-clients@33.3.0
# see above for r-m-dd-config which will always get the current version
# first get a kx509 proxy, then
export RUCIO_ACCOUNT=$USER
rucio list-file-replicas hd-protodune:np04hd_raw_run027296_0000_dataflow3_datawriter_0_20240619T110330.hdf5 --pfns --protocols=root
returns 3 locations:
root://dune.dcache.nikhef.nl:1094/pnfs/nikhef.nl/data/dune/generic/rucio/hd-protodune/e5/57/np04hd_raw_run027296_0000_dataflow3_datawriter_0_20240619T110330.hdf5
root://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/dune/tape_backed/dunepro//hd-protodune/raw/2024/detector/cosmics/None/00/02/72/96/np04hd_raw_run027296_0000_dataflow3_datawriter_0_20240619T110330.hdf5
root://eosctapublic.cern.ch:1094//eos/ctapublic/archive/neutplatform/protodune/rawdata/np04//hd-protodune/raw/2024/detector/cosmics/None/00/02/72/96/np04hd_raw_run027296_0000_dataflow3_datawriter_0_20240619T110330.hdf5
which is the locations of the file on disk and tape. We can use this to copy the file to our local disk or access the file via xroot.
Finding files by characteristics using metacat
To list raw data files for a given run:
metacat query "files where core.file_type=detector \
and core.run_type='protodune-sp' and core.data_tier=raw \
and core.data_stream=physics and core.runs[any] in (5141)"
core.run_type
tells you which of the many DAQ’s this came from.core.file_type
tells detector from mccore.data_tier
could be raw, full-reconstructed, root-tuple. Same data different formats.
protodune-sp:np04_raw_run005141_0013_dl7.root
protodune-sp:np04_raw_run005141_0005_dl3.root
protodune-sp:np04_raw_run005141_0003_dl1.root
protodune-sp:np04_raw_run005141_0004_dl7.root
...
protodune-sp:np04_raw_run005141_0009_dl7.root
protodune-sp:np04_raw_run005141_0014_dl11.root
protodune-sp:np04_raw_run005141_0007_dl6.root
protodune-sp:np04_raw_run005141_0011_dl8.root
Note the presence of both a namespace and a filename
What about some files from a reconstructed version?
metacat query "files from dune:all where core.file_type=detector \
and core.run_type='protodune-sp' and core.data_tier=full-reconstructed \
and core.data_stream=physics and core.runs[any] in (5141) and dune.campaign=PDSPProd4 limit 10"
pdsp_det_reco:np04_raw_run005141_0013_dl10_reco1_18127013_0_20210318T104043Z.root
pdsp_det_reco:np04_raw_run005141_0015_dl4_reco1_18126145_0_20210318T101646Z.root
pdsp_det_reco:np04_raw_run005141_0008_dl12_reco1_18127279_0_20210318T104635Z.root
pdsp_det_reco:np04_raw_run005141_0002_dl2_reco1_18126921_0_20210318T103516Z.root
pdsp_det_reco:np04_raw_run005141_0002_dl14_reco1_18126686_0_20210318T102955Z.root
pdsp_det_reco:np04_raw_run005141_0015_dl5_reco1_18126081_0_20210318T122619Z.root
pdsp_det_reco:np04_raw_run005141_0017_dl10_reco1_18126384_0_20210318T102231Z.root
pdsp_det_reco:np04_raw_run005141_0006_dl4_reco1_18127317_0_20210318T104702Z.root
pdsp_det_reco:np04_raw_run005141_0007_dl9_reco1_18126730_0_20210318T102939Z.root
pdsp_det_reco:np04_raw_run005141_0011_dl7_reco1_18127369_0_20210318T104844Z.root
To see the total number (and size) of files that match a certain query expression, then add the -s
option to metacat query
.
See the metacat documentation for more information about queries. DataCatalogDocs and check out the glossary of common fields at: MetaCatGlossary
Accessing data for use in your analysis
To access data without copying it, XRootD
is the tool to use. However it will work only if the file is staged to the disk.
You can stream files worldwide if you have a DUNE VO certificate as described in the preparation part of this tutorial.
To learn more about using Rucio and Metacat to run over large data samples go here:
Full Justin/Rucio/Metacat Tutorial
Exercise 1
- Use
metacat query ....
to find a file from a particular experiment/run/processing stage. Look in DataCatalogDocs for hints on constructing queries.- Use
metacat file show -m -l namespace:filename
to get metadata for this file. Note that--json
gives the output in json format.
When we are analyzing large numbers of files in a group of batch jobs, we use a metacat dataset to describe the full set of files that we are going to analyze and use the JustIn system to run over that dataset. Each job will then come up and ask metacat and rucio to give it the next file in the list. It will try to find the nearest copy. For instance if you are running at CERN and analyzing this file it will automatically take it from the CERN storage space EOS.
Exercise 2 - explore in the gui
The Metacat Gui is a nice place to explore the data we have.
You need to log in with your services (not kerberos) password.
do a datasets search of all namespaces for the word official in a dataset name
you can then click on sets to see what they contain
Exercise 3 - explore a dataset
Use metacat to find information about the dataset justin-tutorial:justin-tutorial-2024 How many files are in it, what is the total size. (metacat dataset show command, and metacat dataset files command) Use rucio to find one of the files in it.
Resources:
Quiz
Question 01
What is file metadata?
- Information about how and when a file was made
- Information about what type of data the file contains
- Conditions such as liquid argon temperature while the file was being written
- Both A and B
- All of the above
Answer
The correct answer is D - Both A and B.
Comment here
Question 02
How do we determine a DUNE data file location?
- Do `ls -R` on /pnfs/dune and grep
- Use `rucio list-file-replicas` (namespace:filename) --pnfs --protocols=root
- Ask the data management group
- None of the Above
Answer
The correct answer is B - use
rucio list-file-replicas
(namespace:filename).Comment here
Useful links to bookmark
- DataCatalog: https://dune.github.io/DataCatalogDocs
- metacat: [https://dune.github.io/DataCatalogDocs/]
- rucio: [https://rucio.github.io/documentation/]
- Pre-2024 Official dataset definitions: dune-data.fnal.gov
- UPS reference manual
- UPS documentation (redmine)
- UPS qualifiers: About Qualifiers (redmine)
- mrb reference guide (redmine)
- CVMFS on DUNE wiki: Access files in CVMFS
Key Points
SAM and Rucio are data handling systems used by the DUNE collaboration to retrieve data.
Staging is a necessary step to make sure files are on disk in dCache (as opposed to only on tape).
Xrootd allows user to stream data files.