-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Word Alignment engine: #506
Conversation
b547740
to
f72b568
Compare
82c30fe
to
da20548
Compare
41d48d1
to
9b268f2
Compare
@ddaspit - this is finally ready for review. |
c760b43
to
70810a7
Compare
70810a7
to
a6cc484
Compare
8e0d367
to
5630625
Compare
c918168
to
0b9c983
Compare
cc5a66e
to
07f6ca4
Compare
I'll need to update the branch and otherwise might have some small TODOs floating around, but I think this PR is generally ready for review, @johnml1135 @ddaspit 🥳 |
|
One general question - I called the word alignment engine "Statistical" instead of "Nmt". I couldn't come up with a better name. |
Rebasing will be needed - there are a number of changes that have happened that will need to be merged in. |
This should be "8.0" |
I though this was already working. Was something wrong with it? |
Odd - I don't remember needing to redo the Echo translation engine much. Was this to handle the ParallelCorpus changes? |
Do we know why the translation platform service was not enabled before? Was it split out? |
Just trying to remember - the "engine commit service" is all about training pairs and committing the new engine (something done in SMT but not WA). Thot is needed for both, so it is specified separately. |
It's a configuration limitation and duplication (having SMT and WA share the same drive), but good enough for right now. I see no compelling reason to change it at this time. |
So, 3 options: Currently, I am for A. |
I see. What about something like |
I've noticed some changes are needed to the |
Done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r26, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @Enkidu93)
src/Serval/src/Serval.Client/Client.g.cs
line 10555 at r25 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
I see. What about something like
ScoredWordPair
orScoredAlignedWordPair
or is it better just to revert like you said?
This is a bigger issue than just AlignedWordPairDto
. This also impacts ParallelCorpusFilterConfigDto
, ParallelCorpusFilterDto
, TrainingCorpusConfigDto
, and TrainingCorpusDto
. The name collisions happen in the Swagger JSON file as well. This means that we need to make a breaking change to get this right. We have two options:
- Prefix all DTO names with "Translation" or "WordAlignment". This was what I originally was trying to do, but clearly failed, since I allowed some DTO names without the prefix. We can do this automatically using the
ServalSchemaNameGenerator
. We could remove the prefix from DTOs that already have the prefix, i.e.TranslationEngineDto
could be renamed toEngineDto
. - Add custom hooks in NSwag to generate DTOs in "Serval.Client" with different namespaces. For example,
Serval.Translation.AlignedWordPairDto
would be generated in the client asServal.Translation.AlignedWordPair
instead ofServal.Client.AlignedWordPair
. The translation engines client interface would be generated asServal.Translation.ITranslationEnginesClient
. I don't think NSwag supports this out-of-the-box. We would need to create our own console application that calls NSwag. See this issue for more information.
We should probably meet to discuss what approach we want to take.
src/Machine/src/Serval.Machine.Shared/Services/ServalWordAlignmentPlatformOutboxConstants.cs
line 13 at r26 (raw file):
public const string BuildRestarting = "BuildRestarting"; public const string IncrementTrainEngineCorpusSize = "IncrementTrainEngineCorpusSize"; public const string InsertWordAlignmentResults = "InsertWordAlignmentResults";
This can be shortened to InsertWordAlignments
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 229 of 233 files reviewed, 2 unresolved discussions (waiting on @ddaspit)
src/Machine/src/Serval.Machine.Shared/Services/ServalWordAlignmentPlatformOutboxConstants.cs
line 13 at r26 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
This can be shortened to
InsertWordAlignments
.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 229 of 233 files reviewed, 2 unresolved discussions (waiting on @ddaspit)
src/Serval/src/Serval.Client/Client.g.cs
line 10555 at r25 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
This is a bigger issue than just
AlignedWordPairDto
. This also impactsParallelCorpusFilterConfigDto
,ParallelCorpusFilterDto
,TrainingCorpusConfigDto
, andTrainingCorpusDto
. The name collisions happen in the Swagger JSON file as well. This means that we need to make a breaking change to get this right. We have two options:
- Prefix all DTO names with "Translation" or "WordAlignment". This was what I originally was trying to do, but clearly failed, since I allowed some DTO names without the prefix. We can do this automatically using the
ServalSchemaNameGenerator
. We could remove the prefix from DTOs that already have the prefix, i.e.TranslationEngineDto
could be renamed toEngineDto
.- Add custom hooks in NSwag to generate DTOs in "Serval.Client" with different namespaces. For example,
Serval.Translation.AlignedWordPairDto
would be generated in the client asServal.Translation.AlignedWordPair
instead ofServal.Client.AlignedWordPair
. The translation engines client interface would be generated asServal.Translation.ITranslationEnginesClient
. I don't think NSwag supports this out-of-the-box. We would need to create our own console application that calls NSwag. See this issue for more information.We should probably meet to discuss what approach we want to take.
OK. How 'breaking' of a change would that be? Like, Serval v2 kinda breaking? Could we just prefix the WordAlignment data classes with WordAlignment
for now (since Serval = CAT = translation-first, i.e., translation is kinda the default for Serval), and then when we, in the near future, remove the deprecated corpora and make a v2, we could fix this as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 218 of 237 files reviewed, 2 unresolved discussions (waiting on @ddaspit and @johnml1135)
src/Serval/src/Serval.Client/Client.g.cs
line 10555 at r25 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
OK. How 'breaking' of a change would that be? Like, Serval v2 kinda breaking? Could we just prefix the WordAlignment data classes with
WordAlignment
for now (since Serval = CAT = translation-first, i.e., translation is kinda the default for Serval), and then when we, in the near future, remove the deprecated corpora and make a v2, we could fix this as well?
Done. I created an issue to track this. For now, these dtos have been moved to Serval.Shared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r27, 15 of 15 files at r28, all commit messages.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on @johnml1135)
@Enkidu93 - can you rebase and make sure that Machine is at the latest version? That should enable all the tests to pass again. |
Just pushed to update Machine. Yes, I'll go ahead and rebase. |
b90b997
to
8fc5412
Compare
Finally! 😁 @johnml1135, I think the 3.6.0 release was done before sillsdev/machine#287 made it in |
@Enkidu93 the tests are still failing and there are some conflicts that need to be resolved. |
* Add HTTP API * Add Grpc * Add Echo (as single engine) * Add Serval services * Add webhooks * Update docker-compose * Basic echo works * Machine Engine updates are made * Add Word Alignment tests * Fix Translation Engine - parallel corpus both TextIds and ScriptureRange should cause an * Use matching machine branch for CI tests * Update to MongoDB 8.0 * Add new word alignment platform pipeline * Add some integration tests (broken)
Add cleanup service tests for word alignment Fix flaky test Fix naming Fix more names Passing statistical engine service tests - first pass Add hangfire implementation and tests Refactor alignment data structure; revert to pretranslate/word-align where appropriate Use parallel data when inferencing for word alignment Fix JSON serialization Rebase-related changes: extend executionData, commit-related issues to WA, small rebase mistakes Get rid of WordAlignmentResult Remove TODO
Separate translation and WA alignedwordpair; add score; fix usings
…et; add note to API
04b331a
to
c973ab6
Compare
This change is