-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MMIF storage API #50
Comments
One big piece I missed in the above description was runtime configurations of the apps. I think we can treat them just like they are parts of app identification. Namely, instead of A few problems with this implementation using a directory structure
|
Another aspect of the problem is that not all pipelines are "serial" (some components in the pipeline can be in parallel), although with some careful consideration and design, we should be able to serialize them into a single string identifier. For example, say that we want to 1) force-align a transcript, 2) run NER on transcript 3) find NE temporal locations in the video. Component 2 (NER) does not rely on the output from the component 1 (FA), and this is the case I mean by "in parallel". (they don't need to "run side-by-side" as software) |
here are some observations / thoughts / questions at the moment: Indexing
RetrievalNo notes here, the proposed implementation of retrieval (using mmif-rewinder) seems like a perfectly valid approach. |
For "params" part of the directory, instead of enumerating all the k-v pairs in the parameters dict, we can "hash" the entire dict as @MrSqually suggested. To mitigate the hurt in readability, we can throw in an additional file (cleared marked as "non-mmif" in the filename) with the original string values before the hashing. Practically, we do
|
New Feature Summary
Right now, all the prediction/hypothesis/experimental MMIF output from CLAMS pipelines are pushed to this github repo for evaluation, along with their back-pointers in the report.md files. That works fine until we hit the storage limit on the GH repo. To future-proof, we'd like to develop a systematic way of storing and retrieving MMIF output files in a more spacious (and maybe more private) storage solution.
storage side
indexing
To hold a large collection of MMIF data, I'm proposing we implement a kind of trie-based indexing system. Actual files can be stored in maybe s3 buckets, or lab servers (we have plenty hdd space anyway). The envisioned trie implementation is simply based on the apps used in the MMIF file, split into shotname and version name. This way, all the necessary "configuration" for the store API is all saved inside the data payload itself, and we don't have to come up with additional configuration scheme for the store API. In other word, user can call the store API, simply with the MMIF file itself.
storage API result example
For example, if we use directory structure for indexing, to store a MMIF
cpb-aacip-xxxx.mmif
generated from a pipeline consist ofWhen user sends the file
the file is saved as
/some_data_root/swt/v5.0/doctr/v2.0/whisper/v1.6/cpb-aacip-xxxx.mmif
Then later if we have a second MMIF file to store, using the same pipeline, except now with whisper 2.0,
the file is saved as
/some_data_root/swt/v5.0/doctr/v2.0/whisper/v2.0/cpb-aacip-xxxx.mmif
This will result in a file system-based storage that looks like this at this point
retrieval side
retriever API argument structure
Now on the retrieval side, a retrieval API should expect two string arguments
simple retrieval
Then, the retriever can convert the first argument into a directory path, and look for the second argument in the directory.
retrieval with rewind
However, in addition to the simple file retrieval, we can dynamically "rewind" MMIFs if there's any decedent MMIF exists. For example, if the user asked for
pipeline=swt/v5.0:doctr/v2.0
, even though the file is not stored in the storage system, the retriever can continue "walk down" the subdirectories until it find the first MMIF, then use clamsproject/clams-python#190 to return a partial MMIF that meets the user request.automatic garbage collection
Given the power of rewind, we can always delete any intermediate MMIF, and keep only the files in the terminal subdirectory. This can be a cronjob (if using file system), or more sophisticated DB management.
Related
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: