retriever¶
FileRetriever classes
- class retriever.DidRetriever(dids: list, dupes: set | None = None)[source]¶
Class for retrieving metadata from MetaCat using a list of DIDs.
Initialize the DidRetriever with a list of DIDs.
- Parameters:
dids – list of file DIDs to find
dupes – set of indices of duplicate DIDs
- check_duplicates() set[source]¶
Check DID list for duplicate entries.
- Returns:
set of indices of duplicate DIDs
- async get_metadata(batch: InputBatch, limit: int) list[source]¶
Asynchronously request a batch of DIDs from MetaCat
- Parameters:
batch – InputBatch object with skip index set
limit – maximum number of files to retrieve
- Returns:
list of file metadata dictionaries
- class retriever.InputBatch(skip: int = -1, files: list | None = None)[source]¶
Class representing a batch of input file data, starting at a specific skip index.
- class retriever.LocalMetaRetriever(paths: list)[source]¶
MetaRetriever for local files
Initialize the LocalMetaRetriever with a list of json files.
- Parameters:
paths – list of metadata file paths
- async get_metadata(batch: InputBatch, limit: int) list[source]¶
Asynchronously retrieve metadata for a specific batch of files.
- Parameters:
batch – InputBatch object containing skip index and list of file names
limit – maximum number of files to retrieve
- Returns:
list of file metadata dictionaries
- class retriever.MetaRetriever[source]¶
Base class for retrieving metadata from a source
- async check_existence(files: list) None[source]¶
Check that MetaCat records exist for a batch of input files.
- Parameters:
files – list of file metadata dictionaries to check
- async check_parents(files: list) None[source]¶
Check that MetaCat records exist for the parents of a batch of input files. Also get the siblings if we need them for the already merged check.
- Parameters:
files – list of file metadata dictionaries to check
- async get_batch(getter: Callable, batch: InputBatch, **kwargs) InputBatch[source]¶
Asynchronously retrieve a batch of input data, with caching.
- Parameters:
getter – function to call to retrieve inputs
batch – InputBatch object to retrieve data for
kwargs – additional arguments to pass to getter
- Returns:
list of file dictionaries
- async get_done() None[source]¶
Asynchronously query MetaCat for merged files with the same tag as this job
- async get_files(query: list) list[source]¶
Asynchronously retrieve file metadata for a specific list of DIDs. Also gets the siblings if we need them for the already merged check.
- Parameters:
query – list of dictionaries with ‘did’ keys to retrieve
- Returns:
list of file metadata dictionaries
- abstract async get_metadata(batch: InputBatch, limit: int) list[source]¶
Asynchronously retrieve metadata for a specific batch of files.
- Parameters:
batch – empty InputBatch object with the skip index set
limit – maximum number of files to retrieve
- Returns:
list of file metadata dictionaries
- async get_siblings(files: list) None[source]¶
We check for already merged files by looking at the children of the input files. But in grandparents mode, we actually want the children of the input file parents instead. We don’t need the original children in that case, so just replace them with the siblings.
- Parameters:
files – list of file metadata dictionaries to check
- async input_batches() AsyncGenerator[InputBatch, None][source]¶
Asynchronously retrieve input file metadata in batches.
- Returns:
InputBatch object containing skip index and list of MergeFile objects
- property namespace: str¶
Return the default namespace for files without an explicit namespace. Checks the config for input.namespace, then output.namespace, then defaults to ‘usertests’.
- class retriever.QueryRetriever(query: str)[source]¶
Class for retrieving metadata from MetaCat using an MQL query.
Initialize the QueryRetriever with an MQL query.
- Parameters:
query – MQL query to find files
- async get_metadata(batch: InputBatch, limit: int) list[source]¶
Asynchronously query MetaCat for a specific batch of files
- Parameters:
batch – InputBatch object with skip index set
limit – maximum number of files to retrieve
- Returns:
list of file metadata dictionaries
- retriever.get() MetaRetriever[source]¶
Create and return a metadata retriever based on input mode: files: LocalMetaRetriever if any metadata files were provided, otherwise DidRetriever dids: DidRetriever query: QueryRetriever dataset: QueryRetriever with query for files in the specified dataset
- Returns:
MetaRetriever object for retrieving file metadata