Output format
=================

``larnd-sim`` can generate a realistic datastream saved into an HDF5 file
format. HDF5 is a widely-used, highly performant file format, see `the HDF5 website <https://www.hdfgroup.org/solutions/hdf5>`_ for more details on the file format.
Internally, ``larnd-sim`` uses `h5py <https://www.h5py.org>`_, an open source
cython package, to manage access to the output file, which we highly
recommended if you are getting started with HDF5.

Charge data
-----------

For the charge simulation output, ``larnd-sim`` uses the same datasets
as generated by ``larpix-control`` to provide minimal modifications to
downstream analysis code when working with data or simulation.

You can read more about this format `here <https://larpix-control.readthedocs.io/en/stable/api/format/hdf5format.html>`_.

In addition to the datasets defined in the above link, ``larnd-sim`` adds
true particle association data within the ``mc_packets_assn`` dataset. This
dataset is a 2-dimensional array with the same first dimension as the
``packets`` dataset and a second dimension corresponding to the edep-sim track
segment that contributed to the ADC value. The dataset has two fields;
``track_ids``, the index into the ``tracks`` dataset for each
entry; and ``fraction``, the fraction of the ADC's value that can be attributed
to that edep-sim segment. Because an arbitrary number of track segments can
contribute to each trigger, this dataset is a "ragged" array with null entries
of ``track_ids == -1``.

Light data
----------

For the light simulation output, ``larnd-sim`` uses an analogous data structure
to the ADC64 format generated the the DUNE ND-LAr light system readout
electronics. The datasets are summarized below

 - ``light_trig``, shape ``(n_triggers,)``: meta data associated with each light system self-trigger

   - ``op_channel``, shape ``(n_optical_channels_per_trig,)``: 32-bit integer indicating the optical channel ids included in this trigger

   - ``ts_s``: 64-bit double indicating the global timestamp of the trigger in seconds

   - ``ts_sync``: 64-bit unsigned integer indicating the larpix timestamp of the trigger

 - ``light_wvfm``, shape ``(n_triggers, n_optical_channels_per_trig, n_adc_samples)``: the ADC samples of each light system self-trigger

 - ``light_dat``, shape ``(n_edepsim_segments, n_optical_channels)``: the true number of photoelectrons and first photon arrival time on each optical channel generated by each edep-sim track segment

   - ``n_photons_det``: number of photoelectrons generated by the track segment

   - ``t0_det``: arrival time of first photon from the track segment

 - ``light_wvfm_mc_assn``, shape ``(n_triggers, n_optical_channels_per_trig, n_adc_samples)``, (only generated if full MC truth propogation is enabled): the true contribution of each edep-sim track segment to each ADC sample in the light waveform data

   - ``track_ids``, shape ``(n_max_tracks,)``: index of true edep-sim segment contributing to ADC value

   - ``pe_current``, shape ``(n_max_tracks,)``: true equivalent photocurrent generated by the edep-sim segment

Charge/light matching
---------------------

Because there is no common trigger for the charge and light digitizations,
association between the two datastreams must be done using the timestamp. A
short example using numpy is provided here::

    import h5py
    import numpy as np

    f = h5py.File(<larnd-sim file>, 'r')

    charge_packets = f['packets'][:]
    charge_trigger = charge_packets['packet_type'] == 7
    charge_trigger_timestamp = charge_packets[charge_trigger]['timestamp']

    light_trigger_timestamp = f['light_trig'][:]['ts_sync']

    timestamp, packet_index, light_index = np.intersect1d(charge_trigger_timestamp, light_trigger_timestamp)

    event0_packets = charge_packets[packet_index[0]:packet_index[1]]
    event0_waveforms = f['light_wvfm'][light_index[0]:light_index[1]]