MMIF storage API #50

keighrim · 2024-06-11T22:18:08Z

New Feature Summary

subtask of evaluation process prototype #3 , particularly as a part of "pipeline-runner" component.
also almost duplicate to persisting machine annotations #10

Right now, all the prediction/hypothesis/experimental MMIF output from CLAMS pipelines are pushed to this github repo for evaluation, along with their back-pointers in the report.md files. That works fine until we hit the storage limit on the GH repo. To future-proof, we'd like to develop a systematic way of storing and retrieving MMIF output files in a more spacious (and maybe more private) storage solution.

storage side

indexing

To hold a large collection of MMIF data, I'm proposing we implement a kind of trie-based indexing system. Actual files can be stored in maybe s3 buckets, or lab servers (we have plenty hdd space anyway). The envisioned trie implementation is simply based on the apps used in the MMIF file, split into shotname and version name. This way, all the necessary "configuration" for the store API is all saved inside the data payload itself, and we don't have to come up with additional configuration scheme for the store API. In other word, user can call the store API, simply with the MMIF file itself.

storage API result example

For example, if we use directory structure for indexing, to store a MMIF cpb-aacip-xxxx.mmif generated from a pipeline consist of

swt/v5.0
doctr/v2.0
whisper/v1.6

When user sends the file

curl [email protected] mmif-storage.clams.ai/store

the file is saved as /some_data_root/swt/v5.0/doctr/v2.0/whisper/v1.6/cpb-aacip-xxxx.mmif

Then later if we have a second MMIF file to store, using the same pipeline, except now with whisper 2.0,
the file is saved as /some_data_root/swt/v5.0/doctr/v2.0/whisper/v2.0/cpb-aacip-xxxx.mmif

This will result in a file system-based storage that looks like this at this point

some_data_root/
└── swt
    └── v5.0
        └── doctr
            └── v2.0
                └── whisper
                    ├── v1.6
                    │   └── cpb-aacip-xxxx.mmif
                    └── v2.0
                        └── cpb-aacip-xxxx.mmif

retrieval side

retriever API argument structure

Now on the retrieval side, a retrieval API should expect two string arguments

pipeline configuration, concat into a single str
aapb media GUID

simple retrieval

Then, the retriever can convert the first argument into a directory path, and look for the second argument in the directory.

curl mmif-storage.clams.ai/retrieve?pipeline=swt/v5.0:doctr/v2.0:whisper/v1.6&guid=cpb-aacip-xxxx

retrieval with rewind

However, in addition to the simple file retrieval, we can dynamically "rewind" MMIFs if there's any decedent MMIF exists. For example, if the user asked for pipeline=swt/v5.0:doctr/v2.0, even though the file is not stored in the storage system, the retriever can continue "walk down" the subdirectories until it find the first MMIF, then use clamsproject/clams-python#190 to return a partial MMIF that meets the user request.

automatic garbage collection

Given the power of rewind, we can always delete any intermediate MMIF, and keep only the files in the terminal subdirectory. This can be a cronjob (if using file system), or more sophisticated DB management.

Alternatives

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

keighrim · 2024-06-12T13:24:54Z

One big piece I missed in the above description was runtime configurations of the apps. I think we can treat them just like they are parts of app identification. Namely, instead of [shortname]/[version] we can have [shortname]/[version]/[param1-val1]/[param2-val2] in the pipeline serialization scheme, where "params" parts are alphabetically sorted for easy retrieval.

A few problems with this implementation using a directory structure

for multivalued parameter, we can either flatten values into many (series of) subdirectories, or concatenate them into one. For the latter, we need to introduce another arbitrary "syntax" to the serialization scheme.
parameter values can be pretty much anything, including whitespace and newlines, etc, that can't be safely used in a directory name
parameter values can be any length, while file paths have length limit.

keighrim · 2024-06-12T15:09:33Z

Another aspect of the problem is that not all pipelines are "serial" (some components in the pipeline can be in parallel), although with some careful consideration and design, we should be able to serialize them into a single string identifier.

For example, say that we want to 1) force-align a transcript, 2) run NER on transcript 3) find NE temporal locations in the video. Component 2 (NER) does not rely on the output from the component 1 (FA), and this is the case I mean by "in parallel". (they don't need to "run side-by-side" as software)

MrSqually · 2024-06-13T17:14:42Z

here are some observations / thoughts / questions at the moment:

Indexing

I think the trie structure makes sense, and I support the use of app/version:app/version as the fundamental layout.
runtime parameters do complicate things quite a bit. I think that the overwhelming majority of use cases will sidestep the problems you listed, but maybe we can figure out some sort of hash / shortening algorithm for parameters? if we can generate a fixed length, bidirectional representation of a given param+value pair, we can translate it during both indexing and during lookup for retrieval without otherwise impacting the relationship between the directory layout and the information within a given subtree, and without needing to store a huge map of {parameters:representations}. The con of this is that it's really bad for readability, arguably worse than introducing another arbitrary syntax, and requires some internal work to actually expand/compress the parameter (i.e., maybe it's overkill).
I think there are two ways to handle the "parallel" apps problem. Either we come up with a design solution, or we sidestep the problem entirely by framing the notion of a "mmif pipeline" as being not necessarily a connected sequence of annotations, but a series of application outputs, in which case A=>B and B=>A would be understood as different, even if they don't rely on each other and ultimately produce the same views. I'm not entirely sure the former has an elegant solution, and I understand the problems with the latter (why store the output of appB within appA if they're completely disconnected, issues with replication across different pipelines, potential duplicate storage of v1:appA/v1:appB/v1 and appB/v1:appA/v1, etc.), but since the directory structure is approximating a "timeline" of sorts, maybe Occam's razor here is to just treat the apps in the order they appear in the mmif, and then provide that ordering as a standard within the documentation. This one is messy though, and I don't have any solutions I'm completely confident proposing.

Retrieval

No notes here, the proposed implementation of retrieval (using mmif-rewinder) seems like a perfectly valid approach.

keighrim · 2024-07-12T15:08:02Z

For "params" part of the directory, instead of enumerating all the k-v pairs in the parameters dict, we can "hash" the entire dict as @MrSqually suggested. To mitigate the hurt in readability, we can throw in an additional file (cleared marked as "non-mmif" in the filename) with the original string values before the hashing. Practically, we do

grab parameters from parameters field in the view metadata (why not appConfig? Because at the retrieval time, we want to query the storage system with the actual parameter that users use, not the refined and saturated set of parameters)
concat each k-v pair into a single string (using : or = as the joiner/delimiter), resulting in a list of string, instead of a dict (note that all user parameters are recorded as string in the parameter dict, so we don't have to worry about type casting)
sort the concatenated strings alphabetically (so that the order of parameter at retrieval time doesn't matter)
concat the sorted string into a single string (using , or \n)
compute a hash string (this is not really security-critical operation, so something that is quick and generates a reasonably long string (not too long for a directory name) should be fine for the algorithm selection. (I'm looking at hashlib.md5)
use the hashed string as a part of directory structure, but also json.dump the original param dict in the directory.

keighrim added the ✨N New feature or request label Jun 11, 2024

clams-bot added this to infra Jun 11, 2024

github-project-automation bot moved this to Todo in infra Jun 11, 2024

keighrim mentioned this issue Jul 12, 2024

Prototype for storage API clamsproject/aapb-brandeis-datahousing#7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MMIF storage API #50

MMIF storage API #50

keighrim commented Jun 11, 2024 •

edited

Loading

keighrim commented Jun 12, 2024 •

edited

Loading

keighrim commented Jun 12, 2024

MrSqually commented Jun 13, 2024

keighrim commented Jul 12, 2024

MMIF storage API #50

MMIF storage API #50

Comments

keighrim commented Jun 11, 2024 • edited Loading

New Feature Summary

storage side

indexing

storage API result example

retrieval side

retriever API argument structure

simple retrieval

retrieval with rewind

automatic garbage collection

Related

Alternatives

Additional context

keighrim commented Jun 12, 2024 • edited Loading

keighrim commented Jun 12, 2024

MrSqually commented Jun 13, 2024

Indexing

Retrieval

keighrim commented Jul 12, 2024

keighrim commented Jun 11, 2024 •

edited

Loading

keighrim commented Jun 12, 2024 •

edited

Loading