Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenizers/sentencepiece: Improve performance by removing allocations #42

Merged
merged 1 commit into from
Jun 8, 2024

Conversation

damz
Copy link
Contributor

@damz damz commented May 30, 2024

The sentencepiece tokenizer is pretty slow, mostly because it allocates a ton. This PR removes some of the low-hanging fruit allocations.

goos: linux
goarch: amd64
pkg: github.com/nlpodyssey/cybertron/pkg/tokenizers/sentencepiece/internal/sentencepiece
cpu: Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz
                                      │   old.txt    │               new.txt               │
                                      │    sec/op    │   sec/op     vs base                │
SentencePiece/compose_email_to_joh-36   2235.6µ ± 1%   147.1µ ± 1%  -93.42% (p=0.000 n=10)

                                      │    old.txt     │               new.txt                │
                                      │      B/op      │     B/op      vs base                │
SentencePiece/compose_email_to_joh-36   3289.19Ki ± 0%   27.02Ki ± 0%  -99.18% (p=0.000 n=10)

                                      │   old.txt    │              new.txt               │
                                      │  allocs/op   │ allocs/op   vs base                │
SentencePiece/compose_email_to_joh-36   1830.00 ± 0%   92.00 ± 0%  -94.97% (p=0.000 n=10)

```
goos: linux
goarch: amd64
pkg: github.com/nlpodyssey/cybertron/pkg/tokenizers/sentencepiece/internal/sentencepiece
cpu: Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz
                                      │   old.txt    │               new.txt               │
                                      │    sec/op    │   sec/op     vs base                │
SentencePiece/compose_email_to_joh-36   2235.6µ ± 1%   147.1µ ± 1%  -93.42% (p=0.000 n=10)

                                      │    old.txt     │               new.txt                │
                                      │      B/op      │     B/op      vs base                │
SentencePiece/compose_email_to_joh-36   3289.19Ki ± 0%   27.02Ki ± 0%  -99.18% (p=0.000 n=10)

                                      │   old.txt    │              new.txt               │
                                      │  allocs/op   │ allocs/op   vs base                │
SentencePiece/compose_email_to_joh-36   1830.00 ± 0%   92.00 ± 0%  -94.97% (p=0.000 n=10)
```
@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 40.53%. Comparing base (d0c62f8) to head (88b93f4).

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #42      +/-   ##
==========================================
- Coverage   40.65%   40.53%   -0.13%     
==========================================
  Files          16       16              
  Lines        1429     1426       -3     
==========================================
- Hits          581      578       -3     
  Misses        826      826              
  Partials       22       22              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@matteo-grella matteo-grella merged commit a7ba5c1 into nlpodyssey:main Jun 8, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants