retriever

FileRetriever classes

class retriever.DidRetriever(dids: list, dupes: set | None = None)[source]

Class for retrieving metadata from MetaCat using a list of DIDs.

Initialize the DidRetriever with a list of DIDs.

Parameters:
  • dids – list of file DIDs to find

  • dupes – set of indices of duplicate DIDs

check_duplicates() set[source]

Check DID list for duplicate entries.

Returns:

set of indices of duplicate DIDs

check_namespaces() None[source]

Check DID list for namespace issues.

async get_metadata(batch: InputBatch, limit: int) list[source]

Asynchronously request a batch of DIDs from MetaCat

Parameters:
  • batch – InputBatch object with skip index set

  • limit – maximum number of files to retrieve

Returns:

list of file metadata dictionaries

class retriever.InputBatch(skip: int = -1, files: list | None = None)[source]

Class representing a batch of input file data, starting at a specific skip index.

class retriever.LocalMetaRetriever(paths: list)[source]

MetaRetriever for local files

Initialize the LocalMetaRetriever with a list of json files.

Parameters:

paths – list of metadata file paths

async get_metadata(batch: InputBatch, limit: int) list[source]

Asynchronously retrieve metadata for a specific batch of files.

Parameters:
  • batch – InputBatch object containing skip index and list of file names

  • limit – maximum number of files to retrieve

Returns:

list of file metadata dictionaries

class retriever.MetaRetriever[source]

Base class for retrieving metadata from a source

async check_existence(files: list) None[source]

Check that MetaCat records exist for a batch of input files.

Parameters:

files – list of file metadata dictionaries to check

async check_parents(files: list) None[source]

Check that MetaCat records exist for the parents of a batch of input files. Also get the siblings if we need them for the already merged check.

Parameters:

files – list of file metadata dictionaries to check

async connect() None[source]

Connect to the MetaCat web API

async disconnect() None[source]

Disconnect from the MetaCat web API

property files: MergeSet

Return the set of files from the source

async get_batch(getter: Callable, batch: InputBatch, **kwargs) InputBatch[source]

Asynchronously retrieve a batch of input data, with caching.

Parameters:
  • getter – function to call to retrieve inputs

  • batch – InputBatch object to retrieve data for

  • kwargs – additional arguments to pass to getter

Returns:

list of file dictionaries

async get_done() None[source]

Asynchronously query MetaCat for merged files with the same tag as this job

async get_files(query: list) list[source]

Asynchronously retrieve file metadata for a specific list of DIDs. Also gets the siblings if we need them for the already merged check.

Parameters:

query – list of dictionaries with ‘did’ keys to retrieve

Returns:

list of file metadata dictionaries

abstract async get_metadata(batch: InputBatch, limit: int) list[source]

Asynchronously retrieve metadata for a specific batch of files.

Parameters:
  • batch – empty InputBatch object with the skip index set

  • limit – maximum number of files to retrieve

Returns:

list of file metadata dictionaries

async get_siblings(files: list) None[source]

We check for already merged files by looking at the children of the input files. But in grandparents mode, we actually want the children of the input file parents instead. We don’t need the original children in that case, so just replace them with the siblings.

Parameters:

files – list of file metadata dictionaries to check

async input_batches() AsyncGenerator[InputBatch, None][source]

Asynchronously retrieve input file metadata in batches.

Returns:

InputBatch object containing skip index and list of MergeFile objects

property namespace: str

Return the default namespace for files without an explicit namespace. Checks the config for input.namespace, then output.namespace, then defaults to ‘usertests’.

run() None[source]

Retrieve metadata for all files.

class retriever.QueryRetriever(query: str)[source]

Class for retrieving metadata from MetaCat using an MQL query.

Initialize the QueryRetriever with an MQL query.

Parameters:

query – MQL query to find files

async get_metadata(batch: InputBatch, limit: int) list[source]

Asynchronously query MetaCat for a specific batch of files

Parameters:
  • batch – InputBatch object with skip index set

  • limit – maximum number of files to retrieve

Returns:

list of file metadata dictionaries

retriever.file_serializer(obj)[source]

Custom JSON serializer for MergeFileError objects

retriever.get() MetaRetriever[source]

Create and return a metadata retriever based on input mode: files: LocalMetaRetriever if any metadata files were provided, otherwise DidRetriever dids: DidRetriever query: QueryRetriever dataset: QueryRetriever with query for files in the specified dataset

Returns:

MetaRetriever object for retrieving file metadata