Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Word Alignment engine: #506

Merged
merged 13 commits into from
Feb 28, 2025
Merged

Add Word Alignment engine: #506

merged 13 commits into from
Feb 28, 2025

Conversation

johnml1135
Copy link
Collaborator

@johnml1135 johnml1135 commented Oct 9, 2024

  • Add HTTP API
  • Add Grpc
  • Add Echo
  • Add Serval services
  • Add machine level implementation
  • Add tests

This change is Reviewable

@johnml1135 johnml1135 requested a review from ddaspit October 9, 2024 16:38
@codecov-commenter
Copy link

codecov-commenter commented Oct 9, 2024

Codecov Report

Attention: Patch coverage is 68.02105% with 1094 lines in your changes missing coverage. Please review.

Project coverage is 64.66%. Comparing base (739aa19) to head (c973ab6).

Files with missing lines Patch % Lines
....Shared/Configuration/IMachineBuilderExtensions.cs 0.00% 143 Missing ⚠️
...red/Services/ServalWordAlignmentEngineServiceV1.cs 0.00% 133 Missing ⚠️
...red/Services/ServalWordAlignmentPlatformService.cs 0.00% 102 Missing ⚠️
...e.Shared/Services/ThotWordAlignmentModelFactory.cs 0.00% 63 Missing ⚠️
...src/Serval.WordAlignment/Services/EngineService.cs 88.16% 51 Missing and 7 partials ⚠️
...l.Machine.Shared/Services/ClearMLMonitorService.cs 9.43% 48 Missing ⚠️
...ServalWordAlignmentPlatformOutboxMessageHandler.cs 68.57% 40 Missing and 4 partials ⚠️
...achine.Shared/Services/StatisticalTrainBuildJob.cs 69.34% 37 Missing and 5 partials ⚠️
...hine.Shared/Services/TranslationBuildJobService.cs 0.00% 42 Missing ⚠️
...ignment/Services/WordAlignmentPlatformServiceV1.cs 88.56% 22 Missing and 13 partials ⚠️
... and 57 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #506      +/-   ##
==========================================
+ Coverage   64.18%   64.66%   +0.47%     
==========================================
  Files         281      341      +60     
  Lines       14210    18697    +4487     
  Branches     1847     2429     +582     
==========================================
+ Hits         9121    12090    +2969     
- Misses       4427     5731    +1304     
- Partials      662      876     +214     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@johnml1135 johnml1135 force-pushed the word_alignment_try_2 branch 3 times, most recently from b547740 to f72b568 Compare October 14, 2024 16:58
@johnml1135 johnml1135 force-pushed the word_alignment_try_2 branch from 82c30fe to da20548 Compare October 24, 2024 19:38
@johnml1135 johnml1135 force-pushed the word_alignment_try_2 branch 4 times, most recently from 41d48d1 to 9b268f2 Compare November 6, 2024 16:00
@johnml1135
Copy link
Collaborator Author

@ddaspit - this is finally ready for review.

@johnml1135 johnml1135 force-pushed the word_alignment_try_2 branch 2 times, most recently from c760b43 to 70810a7 Compare November 12, 2024 20:03
@johnml1135 johnml1135 force-pushed the word_alignment_try_2 branch 2 times, most recently from c918168 to 0b9c983 Compare December 9, 2024 17:37
@johnml1135 johnml1135 requested a review from Enkidu93 December 16, 2024 12:30
@Enkidu93
Copy link
Collaborator

I'll need to update the branch and otherwise might have some small TODOs floating around, but I think this PR is generally ready for review, @johnml1135 @ddaspit 🥳

@johnml1135
Copy link
Collaborator Author

-- commits line 47 at r4:
When you get to it, can you clean up the commit message and rebase?

@johnml1135
Copy link
Collaborator Author

docker-compose.yml line 26 at r4 (raw file):

      - ASPNETCORE_WordAlignment__Engines__0__Type=EchoWordAlignment
      - ASPNETCORE_WordAlignment__Engines__0__Address=http://echo
      - ASPNETCORE_WordAlignment__Engines__1__Type=Statistical

One general question - I called the word alignment engine "Statistical" instead of "Nmt". I couldn't come up with a better name.

@johnml1135
Copy link
Collaborator Author

Rebasing will be needed - there are a number of changes that have happened that will need to be merged in.

@johnml1135
Copy link
Collaborator Author

.github/workflows/ci.yml line 22 at r4 (raw file):

        uses: supercharge/[email protected]
        with:
          mongodb-version: "6.0"

This should be "8.0"

@johnml1135
Copy link
Collaborator Author

.github/workflows/ci.yml line 30 at r4 (raw file):

        continue-on-error: true
        if: ${{ github.ref_name }} != "main"
        run: cd .. && git clone https://github.com/sillsdev/machine.git --branch ${{ github.ref_name }} && dotnet build machine

I though this was already working. Was something wrong with it?

@johnml1135
Copy link
Collaborator Author

src/Echo/src/EchoEngine/TranslationEngineServiceV1.cs line 81 at r4 (raw file):

                {
                    using (
                        AsyncClientStreamingCall<InsertPretranslationsRequest, Empty> call =

Odd - I don't remember needing to redo the Echo translation engine much. Was this to handle the ParallelCorpus changes?

@johnml1135
Copy link
Collaborator Author

src/Machine/src/Serval.Machine.EngineServer/Program.cs line 14 at r4 (raw file):

    .AddServalTranslationEngineService()
    .AddServalWordAlignmentEngineService()
    .AddServalTranslationPlatformService()

Do we know why the translation platform service was not enabled before? Was it split out?

@johnml1135
Copy link
Collaborator Author

src/Machine/src/Serval.Machine.Shared/Configuration/IMachineBuilderExtensions.cs line 203 at r4 (raw file):

                case EngineType.SmtTransfer:
                    builder.Services.AddSingleton<SmtTransferEngineStateService>();
                    builder.Services.AddHostedService<SmtTransferEngineCommitService>();

Just trying to remember - the "engine commit service" is all about training pairs and committing the new engine (something done in SMT but not WA). Thot is needed for both, so it is specified separately.

@johnml1135
Copy link
Collaborator Author

src/Machine/src/Serval.Machine.Shared/Configuration/IMachineBuilderExtensions.cs line 534 at r4 (raw file):

            throw new InvalidOperationException("SMT Engine and Statistical directory is required");
        if (smtDriveLetter != statisticsDriveLetter)
            throw new InvalidOperationException("SMT Engine and Statistical directory must be on the same drive");

It's a configuration limitation and duplication (having SMT and WA share the same drive), but good enough for right now. I see no compelling reason to change it at this time.

@johnml1135
Copy link
Collaborator Author

src/Machine/src/Serval.Machine.Shared/Models/WordAlignment.cs line 10 at r4 (raw file):

    public required IReadOnlyList<string> SourceTokens { get; set; }
    public required IReadOnlyList<string> TargetTokens { get; set; }
    public required IReadOnlyList<double> Confidences { get; set; } //TODO It seems to me that it'd more natural to have the confidence as part of the word pair object - but I understand that this is currently not the case with the translation result; would it be breaking to change it there too?

So, 3 options:
A. Leave as is
B. Switch both to put confidences in "AlignedWordPair"
C. Make a new structure that just has the pairs, and not the confidences?

Currently, I am for A.

@Enkidu93
Copy link
Collaborator

Reviewed 15 of 15 files at r25, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @johnml1135)

src/Serval/src/Serval.Client/Client.g.cs line 10555 at r25 (raw file):

    [System.CodeDom.Compiler.GeneratedCode("NJsonSchema", "14.1.0.0 (NJsonSchema v11.0.2.0 (Newtonsoft.Json v13.0.0.0))")]
    public partial class AlignedWordPair2

The name of this class is not good. We either need to have different names for the translation and alignment DTOs or just use the same DTO for both and make sure Score is optional. I tried to prefix DTOs with "Translation" in order to avoid this kind of name collision. Unfortunately, I did not do that with AlignedWordPair. I don't want to break the translation API in "Serval.Client", so we probably have to share the DTO for both contexts. Basically, you will need to revert the split of the aligned word pair model/DTO.

I see. What about something like ScoredWordPair or ScoredAlignedWordPair or is it better just to revert like you said?

@Enkidu93
Copy link
Collaborator

I've noticed some changes are needed to the ServalWordAlignmentPlatformOutboxMessageHandler (and accompanying tests), so I'm working on that.

@Enkidu93
Copy link
Collaborator

I've noticed some changes are needed to the ServalWordAlignmentPlatformOutboxMessageHandler (and accompanying tests), so I'm working on that.

Done.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r26, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @Enkidu93)


src/Serval/src/Serval.Client/Client.g.cs line 10555 at r25 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

I see. What about something like ScoredWordPair or ScoredAlignedWordPair or is it better just to revert like you said?

This is a bigger issue than just AlignedWordPairDto. This also impacts ParallelCorpusFilterConfigDto, ParallelCorpusFilterDto, TrainingCorpusConfigDto, and TrainingCorpusDto. The name collisions happen in the Swagger JSON file as well. This means that we need to make a breaking change to get this right. We have two options:

  1. Prefix all DTO names with "Translation" or "WordAlignment". This was what I originally was trying to do, but clearly failed, since I allowed some DTO names without the prefix. We can do this automatically using the ServalSchemaNameGenerator. We could remove the prefix from DTOs that already have the prefix, i.e. TranslationEngineDto could be renamed to EngineDto.
  2. Add custom hooks in NSwag to generate DTOs in "Serval.Client" with different namespaces. For example, Serval.Translation.AlignedWordPairDto would be generated in the client as Serval.Translation.AlignedWordPair instead of Serval.Client.AlignedWordPair. The translation engines client interface would be generated as Serval.Translation.ITranslationEnginesClient. I don't think NSwag supports this out-of-the-box. We would need to create our own console application that calls NSwag. See this issue for more information.

We should probably meet to discuss what approach we want to take.


src/Machine/src/Serval.Machine.Shared/Services/ServalWordAlignmentPlatformOutboxConstants.cs line 13 at r26 (raw file):

    public const string BuildRestarting = "BuildRestarting";
    public const string IncrementTrainEngineCorpusSize = "IncrementTrainEngineCorpusSize";
    public const string InsertWordAlignmentResults = "InsertWordAlignmentResults";

This can be shortened to InsertWordAlignments.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 229 of 233 files reviewed, 2 unresolved discussions (waiting on @ddaspit)


src/Machine/src/Serval.Machine.Shared/Services/ServalWordAlignmentPlatformOutboxConstants.cs line 13 at r26 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This can be shortened to InsertWordAlignments.

Done.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 229 of 233 files reviewed, 2 unresolved discussions (waiting on @ddaspit)


src/Serval/src/Serval.Client/Client.g.cs line 10555 at r25 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This is a bigger issue than just AlignedWordPairDto. This also impacts ParallelCorpusFilterConfigDto, ParallelCorpusFilterDto, TrainingCorpusConfigDto, and TrainingCorpusDto. The name collisions happen in the Swagger JSON file as well. This means that we need to make a breaking change to get this right. We have two options:

  1. Prefix all DTO names with "Translation" or "WordAlignment". This was what I originally was trying to do, but clearly failed, since I allowed some DTO names without the prefix. We can do this automatically using the ServalSchemaNameGenerator. We could remove the prefix from DTOs that already have the prefix, i.e. TranslationEngineDto could be renamed to EngineDto.
  2. Add custom hooks in NSwag to generate DTOs in "Serval.Client" with different namespaces. For example, Serval.Translation.AlignedWordPairDto would be generated in the client as Serval.Translation.AlignedWordPair instead of Serval.Client.AlignedWordPair. The translation engines client interface would be generated as Serval.Translation.ITranslationEnginesClient. I don't think NSwag supports this out-of-the-box. We would need to create our own console application that calls NSwag. See this issue for more information.

We should probably meet to discuss what approach we want to take.

OK. How 'breaking' of a change would that be? Like, Serval v2 kinda breaking? Could we just prefix the WordAlignment data classes with WordAlignment for now (since Serval = CAT = translation-first, i.e., translation is kinda the default for Serval), and then when we, in the near future, remove the deprecated corpora and make a v2, we could fix this as well?

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 218 of 237 files reviewed, 2 unresolved discussions (waiting on @ddaspit and @johnml1135)


src/Serval/src/Serval.Client/Client.g.cs line 10555 at r25 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

OK. How 'breaking' of a change would that be? Like, Serval v2 kinda breaking? Could we just prefix the WordAlignment data classes with WordAlignment for now (since Serval = CAT = translation-first, i.e., translation is kinda the default for Serval), and then when we, in the near future, remove the deprecated corpora and make a v2, we could fix this as well?

Done. I created an issue to track this. For now, these dtos have been moved to Serval.Shared.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong: Yay! 🎉

Reviewed 4 of 4 files at r27, 15 of 15 files at r28, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @johnml1135)

@johnml1135
Copy link
Collaborator Author

@Enkidu93 - can you rebase and make sure that Machine is at the latest version? That should enable all the tests to pass again.

@Enkidu93
Copy link
Collaborator

@Enkidu93 - can you rebase and make sure that Machine is at the latest version? That should enable all the tests to pass again.

Just pushed to update Machine. Yes, I'll go ahead and rebase.

@Enkidu93 Enkidu93 force-pushed the word_alignment_try_2 branch from b90b997 to 8fc5412 Compare February 27, 2025 21:44
@Enkidu93
Copy link
Collaborator

Enkidu93 commented Feb 27, 2025

:lgtm_strong: Yay! 🎉

Reviewed 4 of 4 files at r27, 15 of 15 files at r28, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @johnml1135)

Finally! 😁

@johnml1135, I think the 3.6.0 release was done before sillsdev/machine#287 made it in

@johnml1135
Copy link
Collaborator Author

@Enkidu93 the tests are still failing and there are some conflicts that need to be resolved.

johnml1135 and others added 13 commits February 28, 2025 10:11
* Add HTTP API
* Add Grpc
* Add Echo (as single engine)
* Add Serval services
* Add webhooks
* Update docker-compose
* Basic echo works
* Machine Engine updates are made
* Add Word Alignment tests
* Fix Translation Engine - parallel corpus both TextIds and ScriptureRange should cause an * Use matching machine branch for CI tests
* Update to MongoDB 8.0
* Add new word alignment platform pipeline
* Add some integration tests (broken)
Add cleanup service tests for word alignment

Fix flaky test

Fix naming

Fix more names

Passing statistical engine service tests - first pass

Add hangfire implementation and tests

Refactor alignment data structure; revert to pretranslate/word-align where appropriate

Use parallel data when inferencing for word alignment

Fix JSON serialization

Rebase-related changes: extend executionData, commit-related issues to WA, small rebase mistakes

Get rid of WordAlignmentResult

Remove TODO
Separate translation and WA alignedwordpair; add score; fix usings
@Enkidu93 Enkidu93 force-pushed the word_alignment_try_2 branch from 04b331a to c973ab6 Compare February 28, 2025 15:12
@Enkidu93 Enkidu93 merged commit 8d4987c into main Feb 28, 2025
2 of 3 checks passed
@Enkidu93 Enkidu93 deleted the word_alignment_try_2 branch February 28, 2025 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants