Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically batch texts when too long #6

Open
dennlinger opened this issue Dec 14, 2021 · 0 comments
Open

Automatically batch texts when too long #6

dennlinger opened this issue Dec 14, 2021 · 0 comments
Assignees
Labels
bug Something isn't working enhancement New feature or request

Comments

@dennlinger
Copy link
Collaborator

dennlinger commented Dec 14, 2021

For samples that exceed the 512 subword token limit, we currently do not have a strategy in place to deal with this.
This is both unwanted and relatively easy to improve. There are a few considerations with respect to the exact strategy to be used, but it seems like a good starting point to approximate sentences with something like a lightweight spacy model, and then chunk based on approximate max length.

@dennlinger dennlinger self-assigned this Dec 14, 2021
@dennlinger dennlinger added bug Something isn't working enhancement New feature or request labels Dec 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant