This lesson is still being designed and assembled (Pre-Alpha version)

Data Management (2024 updated for metacat/justin/rucio)

Overview

Teaching: 40 min
Exercises: 0 min
Questions
  • What are the data management tools and software for DUNE?

Objectives
  • Learn how to access data from DUNE Data Catalog.

  • Learn a bit about the JustIN workflow system for submitting batch jobs.

Introduction

DUNE data is stored around the world and the storage elements are not always organized in a way that they can be easily inspected. For this purpose we use the metacat data catalog to describe the data and collections and rucio to determine where replicas of files are. There is also a legacy SAM data access system that can be used for older files.

What is metacat

Metacat is a file catalog - it allows you to search for files that have particular attributes and understand their provenance, including details on all of their processing steps.

You can find extensive documentation on metacat at:

General metacat documentation

DUNE metacat examples

Find a file in metacat

DUNE runs multiple experiments (far detectors, protodune-sp, protodune-dp hd-protodune, vd-protodune, iceberg, coldboxes… ) and produces various kinds of data (mc/detector) and process them through different phases.

To find your data you need to specify at the minimum

and when searching for specific types of data

Here is an example of a metacat query that gets you raw files from a recent ‘hd-protodune’ cosmics run.

First get metacat if you have not already done so

setup metacat # if using SL7
spack load metacat # if using Alma9
metacat auth login -m password $USER  # use your services password to authenticate
export METACAT_AUTH_SERVER_URL=https://metacat.fnal.gov:8143/auth/dune
export METACAT_SERVER_URL=https://metacat.fnal.gov:9443/dune_meta_prod/app 

Note: other means of authentication

Check out the metacat documentation for kx509 and token authentication.

then do queries to find particular sets of files.

metacat query "files from dune:all where core.file_type=detector \
 and core.run_type=hd-protodune and core.data_tier=raw \
 and core.data_stream=cosmics and core.runs[any]=27296 limit 2"

should give you 2 files:

hd-protodune:np04hd_raw_run027296_0000_dataflow3_datawriter_0_20240619T110330.hdf5
hd-protodune:np04hd_raw_run027296_0000_dataflow0_datawriter_0_20240619T110330.hdf5

the string before the ‘:’ is the namespace and the string after is the filename.

You can find out more about your file by doing:

metacat file show -m -l hd-protodune:np04hd_raw_run027296_0000_dataflow3_datawriter_0_20240619T110330.hdf5

which gives you a lot of information:

checksums:
    adler32   : 6a191436
created_timestamp   :	2024-06-19 11:08:24.398197+00:00
creator             :	dunepro
fid                 :	83302138
name                :	np04hd_raw_run027296_0000_dataflow3_datawriter_0_20240619T110330.hdf5
namespace           :	hd-protodune
retired             :	False
retired_by          :	None
retired_timestamp   :	None
size                :	4232017188
updated_by          :	None
updated_timestamp   :	1718795304.398197
metadata:
    core.data_stream    : cosmics
    core.data_tier      : raw
    core.end_time       : 1718795024.0
    core.event_count    : 35
    core.events         : [3, 7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47, 51, 55, 59, 63, 67, 71, 75, 79, 83, 87, 91, 95, 99, 103, 107, 111, 115, 119, 123, 127, 131, 135, 139]
    core.file_content_status: good
    core.file_format    : hdf5
    core.file_type      : detector
    core.first_event_number: 3
    core.last_event_number: 139
    core.run_type       : hd-protodune
    core.runs           : [27296]
    core.runs_subruns   : [2729600001]
    core.start_time     : 1718795010.0
    dune.daq_test       : False
    retention.class     : physics
    retention.status    : active
children:
   hd-protodune-det-reco:np04hd_raw_run027296_0000_dataflow3_datawriter_0_20240619T110330_reco_stage1_20240621T175057_keepup_hists.root (eywzUgkZRZ6llTsU)
   hd-protodune-det-reco:np04hd_raw_run027296_0000_dataflow3_datawriter_0_20240619T110330_reco_stage1_reco_stage2_20240621T175057_keepup.root (GHSm3owITS20vn69)

look in the glossary to see what those fields mean.

find out how much raw data there is in a run using the summary option

metacat query -s "files from dune:all where core.file_type=detector \
 and core.run_type=hd-protodune and core.data_tier=raw \
 and core.data_stream=cosmics and core.runs[any]=27296"
Files:        963
Total size:   4092539942264 (4.093 TB)

To look at all the files in that run you need to use XRootD - DO NOT TRY TO COPY 4 TB to your local area!!!*

What is(was) SAM?

Sequential Access with Metadata (SAM) is/was a data handling system developed at Fermilab. It is designed to tracklocations of files and other file metadata. It has been replaced by metacat.

What is Rucio?

Rucio is the next-generation Data Replica service and is part of DUNE’s new Distributed Data Management (DDM) system that is currently in deployment. Rucio has two functions:

  1. A rule-based system to get files to Rucio Storage Elements around the world and keep them there.
  2. To return the “nearest” replica of any data file for use either in interactive or batch file use. It is expected that most DUNE users will not be regularly using direct Rucio commands, but other wrapper scripts that calls them indirectly.

As of the date of the June 2024 tutorial:

Let’s find a file

If you haven’t already done this earlier in setup

# first get a kx509 proxy, then

export RUCIO_ACCOUNT=$USER


rucio list-file-replicas hd-protodune:np04hd_raw_run027296_0000_dataflow3_datawriter_0_20240619T110330.hdf5 --pfns --protocols=root

returns 3 locations:

root://eospublic.cern.ch:1094//eos/experiment/neutplatform/protodune/dune/hd-protodune/e5/57/np04hd_raw_run027296_0000_dataflow3_datawriter_0_20240619T110330.hdf5
root://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/dune/tape_backed/dunepro//hd-protodune/raw/2024/detector/cosmics/None/00/02/72/96/np04hd_raw_run027296_0000_dataflow3_datawriter_0_20240619T110330.hdf5
root://eosctapublic.cern.ch:1094//eos/ctapublic/archive/neutplatform/protodune/rawdata/np04//hd-protodune/raw/2024/detector/cosmics/None/00/02/72/96/np04hd_raw_run027296_0000_dataflow3_datawriter_0_20240619T110330.hdf5

which is the locations of the file on disk and tape. We can use this to copy the file to our local disk or access the file via xroot.

Finding files by characteristics using metacat

To list raw data files for a given run:

metacat query "files where core.file_type=detector \
 and core.run_type='protodune-sp' and core.data_tier=raw \
 and core.data_stream=physics and core.runs[any] in (5141)"
protodune-sp:np04_raw_run005141_0013_dl7.root
protodune-sp:np04_raw_run005141_0005_dl3.root
protodune-sp:np04_raw_run005141_0003_dl1.root
protodune-sp:np04_raw_run005141_0004_dl7.root
...
protodune-sp:np04_raw_run005141_0009_dl7.root
protodune-sp:np04_raw_run005141_0014_dl11.root
protodune-sp:np04_raw_run005141_0007_dl6.root
protodune-sp:np04_raw_run005141_0011_dl8.root

Note the presence of both a namespace and a filename

What about some files from a reconstructed version?

metacat query "files from dune:all where core.file_type=detector \
 and core.run_type='protodune-sp' and core.data_tier=full-reconstructed  \
 and core.data_stream=physics and core.runs[any] in (5141) and dune.campaign=PDSPProd4 limit 10" 
pdsp_det_reco:np04_raw_run005141_0013_dl10_reco1_18127013_0_20210318T104043Z.root
pdsp_det_reco:np04_raw_run005141_0015_dl4_reco1_18126145_0_20210318T101646Z.root
pdsp_det_reco:np04_raw_run005141_0008_dl12_reco1_18127279_0_20210318T104635Z.root
pdsp_det_reco:np04_raw_run005141_0002_dl2_reco1_18126921_0_20210318T103516Z.root
pdsp_det_reco:np04_raw_run005141_0002_dl14_reco1_18126686_0_20210318T102955Z.root
pdsp_det_reco:np04_raw_run005141_0015_dl5_reco1_18126081_0_20210318T122619Z.root
pdsp_det_reco:np04_raw_run005141_0017_dl10_reco1_18126384_0_20210318T102231Z.root
pdsp_det_reco:np04_raw_run005141_0006_dl4_reco1_18127317_0_20210318T104702Z.root
pdsp_det_reco:np04_raw_run005141_0007_dl9_reco1_18126730_0_20210318T102939Z.root
pdsp_det_reco:np04_raw_run005141_0011_dl7_reco1_18127369_0_20210318T104844Z.root

To see the total number of files that match a certain query expression, then add the -s option to metacat query.

See the metacat documentation for more information about queries. DataCatalogDocs and check out the glossary of common fields at: MetaCatGlossary

Accessing data for use in your analysis

To access data without copying it, XRootD is the tool to use. However it will work only if the file is staged to the disk.

You can stream files worldwide if you have a DUNE VO certificate as described in the preparation part of this tutorial.

To learn more about using Rucio and Metacat to run over large data samples go here:

Full Justin/Rucio/Metacat Tutorial

The Justin/Rucio/Metacat Tutorial and justin tutorial

Exercise 1

  • Use metacat to find a file from a particular experiment/run/processing stage
  • Use metacat file show -m namespace:filename to get metadata for this file. Note that --json gives the output in json format.

When we are analyzing large numbers of files in a group of batch jobs, we use a metacat dataset to describe the full set of files that we are going to analyze and use the JustIn system to run over that dataset. Each job will then come up and ask metacat and rucio to give it the next file in the list. It will try to find the nearest copy. For instance if you are running at CERN and analyzing this file it will automatically take it from the CERN storage space EOS.

Exercise 2

FIXME Need to make an example of looking at a dataset

Resources:

Quiz

Question 01

What is file metadata?

  1. Information about how and when a file was made
  2. Information about what type of data the file contains
  3. Conditions such as liquid argon temperature while the file was being written
  4. Both A and B
  5. All of the above

Answer

The correct answer is D - Both A and B.

Comment here

Question 02

How do we determine a DUNE data file location?

  1. Do `ls -R` on /pnfs/dune and grep
  2. Use `rucio list-file-replicas` (namespace:filename) --pnfs --protocols=root
  3. Ask the data management group
  4. None of the Above

Answer

The correct answer is B - use rucio list-file-replicas (namespace:filename).

Comment here

Key Points

  • SAM and Rucio are data handling systems used by the DUNE collaboration to retrieve data.

  • Staging is a necessary step to make sure files are on disk in dCache (as opposed to only on tape).

  • Xrootd allows user to stream data files.