Data Management (2024 updated for metacat/justin/rucio)
Overview
Teaching: 40 min
Exercises: 0 minQuestions
What are the data management tools and software for DUNE?
Objectives
Learn how to access data from DUNE Data Catalog.
Learn a bit about the JustIN workflow system for submitting batch jobs.
Introduction
DUNE data is stored around the world and the storage elements are not always organized in a way that they can be easily inspected. For this purpose we use the metacat data catalog to describe the data and collections and rucio to determine where replicas of files are. There is also a legacy SAM data access system that can be used for older files.
What is metacat
Metacat is a file catalog - it allows you to search for files that have particular attributes and understand their provenance, including details on all of their processing steps.
You can find extensive documentation on metacat at:
Find a file in metacat
DUNE runs multiple experiments (far detectors, protodune-sp, protodune-dp hd-protodune, vd-protodune, iceberg, coldboxes… ) and produces various kinds of data (mc/detector) and process them through different phases.
To find your data you need to specify at the minimum
core.run_type
(the experiment)core.file_type
(mc or detecor)core.data_tier
(the level of processing raw, full-reconstructed, root-tuple)
and when searching for specific types of data
core.data_stream
(physics, calibration, cosmics)core.runs[any]=<runnumber>
Here is an example of a metacat query that gets you raw files from a recent ‘hd-protodune’ cosmics run.
First get metacat if you have not already done so
setup metacat # if using SL7
spack load metacat # if using Alma9
metacat auth login -m password $USER # use your services password to authenticate
export METACAT_AUTH_SERVER_URL=https://metacat.fnal.gov:8143/auth/dune
export METACAT_SERVER_URL=https://metacat.fnal.gov:9443/dune_meta_prod/app
Note: other means of authentication
Check out the metacat documentation for kx509 and token authentication.
then do queries to find particular sets of files.
metacat query "files from dune:all where core.file_type=detector \
and core.run_type=hd-protodune and core.data_tier=raw \
and core.data_stream=cosmics and core.runs[any]=27296 limit 2"
should give you 2 files:
hd-protodune:np04hd_raw_run027296_0000_dataflow3_datawriter_0_20240619T110330.hdf5
hd-protodune:np04hd_raw_run027296_0000_dataflow0_datawriter_0_20240619T110330.hdf5
the string before the ‘:’ is the namespace and the string after is the filename.
You can find out more about your file by doing:
metacat file show -m -l hd-protodune:np04hd_raw_run027296_0000_dataflow3_datawriter_0_20240619T110330.hdf5
which gives you a lot of information:
checksums:
adler32 : 6a191436
created_timestamp : 2024-06-19 11:08:24.398197+00:00
creator : dunepro
fid : 83302138
name : np04hd_raw_run027296_0000_dataflow3_datawriter_0_20240619T110330.hdf5
namespace : hd-protodune
retired : False
retired_by : None
retired_timestamp : None
size : 4232017188
updated_by : None
updated_timestamp : 1718795304.398197
metadata:
core.data_stream : cosmics
core.data_tier : raw
core.end_time : 1718795024.0
core.event_count : 35
core.events : [3, 7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47, 51, 55, 59, 63, 67, 71, 75, 79, 83, 87, 91, 95, 99, 103, 107, 111, 115, 119, 123, 127, 131, 135, 139]
core.file_content_status: good
core.file_format : hdf5
core.file_type : detector
core.first_event_number: 3
core.last_event_number: 139
core.run_type : hd-protodune
core.runs : [27296]
core.runs_subruns : [2729600001]
core.start_time : 1718795010.0
dune.daq_test : False
retention.class : physics
retention.status : active
children:
hd-protodune-det-reco:np04hd_raw_run027296_0000_dataflow3_datawriter_0_20240619T110330_reco_stage1_20240621T175057_keepup_hists.root (eywzUgkZRZ6llTsU)
hd-protodune-det-reco:np04hd_raw_run027296_0000_dataflow3_datawriter_0_20240619T110330_reco_stage1_reco_stage2_20240621T175057_keepup.root (GHSm3owITS20vn69)
look in the glossary to see what those fields mean.
find out how much raw data there is in a run using the summary option
metacat query -s "files from dune:all where core.file_type=detector \
and core.run_type=hd-protodune and core.data_tier=raw \
and core.data_stream=cosmics and core.runs[any]=27296"
Files: 963
Total size: 4092539942264 (4.093 TB)
To look at all the files in that run you need to use XRootD - DO NOT TRY TO COPY 4 TB to your local area!!!*
What is(was) SAM?
Sequential Access with Metadata (SAM) is/was a data handling system developed at Fermilab. It is designed to tracklocations of files and other file metadata. It has been replaced by metacat.
What is Rucio?
Rucio is the next-generation Data Replica service and is part of DUNE’s new Distributed Data Management (DDM) system that is currently in deployment. Rucio has two functions:
- A rule-based system to get files to Rucio Storage Elements around the world and keep them there.
- To return the “nearest” replica of any data file for use either in interactive or batch file use. It is expected that most DUNE users will not be regularly using direct Rucio commands, but other wrapper scripts that calls them indirectly.
As of the date of the June 2024 tutorial:
- The Rucio client is available in CVMFS
- Most DUNE users are now enabled to use it. New users may not automatically be added.
Let’s find a file
If you haven’t already done this earlier in setup
- On sl7 type
setup rucio
- On al9 type
spack load rucio-clients@33.3.0
# may need to update that version #
# first get a kx509 proxy, then
export RUCIO_ACCOUNT=$USER
rucio list-file-replicas hd-protodune:np04hd_raw_run027296_0000_dataflow3_datawriter_0_20240619T110330.hdf5 --pfns --protocols=root
returns 3 locations:
root://eospublic.cern.ch:1094//eos/experiment/neutplatform/protodune/dune/hd-protodune/e5/57/np04hd_raw_run027296_0000_dataflow3_datawriter_0_20240619T110330.hdf5
root://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/dune/tape_backed/dunepro//hd-protodune/raw/2024/detector/cosmics/None/00/02/72/96/np04hd_raw_run027296_0000_dataflow3_datawriter_0_20240619T110330.hdf5
root://eosctapublic.cern.ch:1094//eos/ctapublic/archive/neutplatform/protodune/rawdata/np04//hd-protodune/raw/2024/detector/cosmics/None/00/02/72/96/np04hd_raw_run027296_0000_dataflow3_datawriter_0_20240619T110330.hdf5
which is the locations of the file on disk and tape. We can use this to copy the file to our local disk or access the file via xroot.
Finding files by characteristics using metacat
To list raw data files for a given run:
metacat query "files where core.file_type=detector \
and core.run_type='protodune-sp' and core.data_tier=raw \
and core.data_stream=physics and core.runs[any] in (5141)"
core.run_type
tells you which of the many DAQ’s this came from.core.file_type
tells detector from mccore.data_tier
could be raw, full-reconstructed, root-tuple. Same data different formats.
protodune-sp:np04_raw_run005141_0013_dl7.root
protodune-sp:np04_raw_run005141_0005_dl3.root
protodune-sp:np04_raw_run005141_0003_dl1.root
protodune-sp:np04_raw_run005141_0004_dl7.root
...
protodune-sp:np04_raw_run005141_0009_dl7.root
protodune-sp:np04_raw_run005141_0014_dl11.root
protodune-sp:np04_raw_run005141_0007_dl6.root
protodune-sp:np04_raw_run005141_0011_dl8.root
Note the presence of both a namespace and a filename
What about some files from a reconstructed version?
metacat query "files from dune:all where core.file_type=detector \
and core.run_type='protodune-sp' and core.data_tier=full-reconstructed \
and core.data_stream=physics and core.runs[any] in (5141) and dune.campaign=PDSPProd4 limit 10"
pdsp_det_reco:np04_raw_run005141_0013_dl10_reco1_18127013_0_20210318T104043Z.root
pdsp_det_reco:np04_raw_run005141_0015_dl4_reco1_18126145_0_20210318T101646Z.root
pdsp_det_reco:np04_raw_run005141_0008_dl12_reco1_18127279_0_20210318T104635Z.root
pdsp_det_reco:np04_raw_run005141_0002_dl2_reco1_18126921_0_20210318T103516Z.root
pdsp_det_reco:np04_raw_run005141_0002_dl14_reco1_18126686_0_20210318T102955Z.root
pdsp_det_reco:np04_raw_run005141_0015_dl5_reco1_18126081_0_20210318T122619Z.root
pdsp_det_reco:np04_raw_run005141_0017_dl10_reco1_18126384_0_20210318T102231Z.root
pdsp_det_reco:np04_raw_run005141_0006_dl4_reco1_18127317_0_20210318T104702Z.root
pdsp_det_reco:np04_raw_run005141_0007_dl9_reco1_18126730_0_20210318T102939Z.root
pdsp_det_reco:np04_raw_run005141_0011_dl7_reco1_18127369_0_20210318T104844Z.root
To see the total number of files that match a certain query expression, then add the -s
option to metacat query
.
See the metacat documentation for more information about queries. DataCatalogDocs and check out the glossary of common fields at: MetaCatGlossary
Accessing data for use in your analysis
To access data without copying it, XRootD
is the tool to use. However it will work only if the file is staged to the disk.
You can stream files worldwide if you have a DUNE VO certificate as described in the preparation part of this tutorial.
To learn more about using Rucio and Metacat to run over large data samples go here:
Full Justin/Rucio/Metacat Tutorial
Exercise 1
- Use
metacat
to find a file from a particular experiment/run/processing stage- Use
metacat file show -m namespace:filename
to get metadata for this file. Note that--json
gives the output in json format.
When we are analyzing large numbers of files in a group of batch jobs, we use a metacat dataset to describe the full set of files that we are going to analyze and use the JustIn system to run over that dataset. Each job will then come up and ask metacat and rucio to give it the next file in the list. It will try to find the nearest copy. For instance if you are running at CERN and analyzing this file it will automatically take it from the CERN storage space EOS.
Exercise 2
FIXME Need to make an example of looking at a dataset
Resources:
Quiz
Question 01
What is file metadata?
- Information about how and when a file was made
- Information about what type of data the file contains
- Conditions such as liquid argon temperature while the file was being written
- Both A and B
- All of the above
Answer
The correct answer is D - Both A and B.
Comment here
Question 02
How do we determine a DUNE data file location?
- Do `ls -R` on /pnfs/dune and grep
- Use `rucio list-file-replicas` (namespace:filename) --pnfs --protocols=root
- Ask the data management group
- None of the Above
Answer
The correct answer is B - use
rucio list-file-replicas
(namespace:filename).Comment here
Useful links to bookmark
- Metacat: https://dune.github.io/DataCatalogDocs
- Pre-2024 Official dataset definitions: dune-data.fnal.gov
- UPS reference manual
- UPS documentation (redmine)
- UPS qualifiers: About Qualifiers (redmine)
- mrb reference guide (redmine)
- CVMFS on DUNE wiki: Access files in CVMFS
Key Points
SAM and Rucio are data handling systems used by the DUNE collaboration to retrieve data.
Staging is a necessary step to make sure files are on disk in dCache (as opposed to only on tape).
Xrootd allows user to stream data files.