Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project Extension #2

Open
AmitMY opened this issue Nov 20, 2024 · 1 comment
Open

Project Extension #2

AmitMY opened this issue Nov 20, 2024 · 1 comment

Comments

@AmitMY
Copy link
Contributor

AmitMY commented Nov 20, 2024

Now that we know that using a translation model is beneficial, we would like to make it more robust.
Specifically:

  1. We find that the model works decently when the input is a single word or a short sentence,
    but not when the input is a long sentence or a paragraph. (In practice, we use sentence-splitting before translating, but this is not ideal, for context dependent info)
  2. The model might not be accurate to simple semantic variations (desk vs table), likely since it is trained
    from scratch, with a low-data setting.

To address these issues, we propose curating multiple data sources and fine-tuning LLMs.

  1. The parallel data from SignBank+ is of good quality (not perfect).
  2. We can use monolingual data alongside language models to generate synthetic sentence level data.
    This would be similar to this paper replacing the "rule-based" approach with a large language model.
  3. Key phrases can be extracted from the SignBank+ data, and understood as "template + slots"
    including fingerspelling can be used to generate high quality synthetic data by replacing the fingerspelled entity.
  4. Large sign language translation datasets can be automatically segmented and transcribed. This will create a large multilingual parallel document level dataset, with low quality SignWriting.

Once data is collected, we will need to find a training recipe that makes sense with
multiple languages and various data proportions, for either of the translation directions.

We would treat the existing models as baselines, and evaluate SignWriting output using signwriting-evaluation

@AmitMY
Copy link
Contributor Author

AmitMY commented Jan 6, 2025

If the approach heavily relies on linguistic information, similar to this paper there are some books, such as for ASL or BSL - I own PDF versions.

If we rather rely on examples, this ASL phrase book can be useful. This is going more to the territory of sign-gpt where perhaps we can train a large model on all of this information, and then use the model to generate new data, useful as a synthetic baseline for translation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant