Skip to content

Commit

Permalink
[TOKENIZER EXAMPLE]
Browse files Browse the repository at this point in the history
  • Loading branch information
Kye committed Dec 7, 2023
1 parent 0ce0b51 commit 5324ce3
Show file tree
Hide file tree
Showing 3 changed files with 36 additions and 10 deletions.
24 changes: 24 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ To implement this model effectively, I intend to initially focus on the image em
- qk norm
- no pos embeds
- kv cache

```python
import torch
from gemini_torch import Gemini
Expand Down Expand Up @@ -108,6 +109,29 @@ print(y.shape)
```
------



## Tokenizer
- We're using the same tokenizer as LLAMA with special tokens denoting the beginning and end of the multi modality tokens.
- Does not fully process img, audio, or videos now we need help on that

```python
from gemini_torch.tokenizer import MultimodalSentencePieceTokenizer

# Example usage
tokenizer_name = "hf-internal-testing/llama-tokenizer"
tokenizer = MultimodalSentencePieceTokenizer(tokenizer_name=tokenizer_name)

# Encoding and decoding examples
encoded_audio = tokenizer.encode("Audio description", modality="audio")
decoded_audio = tokenizer.decode(encoded_audio)

print("Encoded audio:", encoded_audio)
print("Decoded audio:", decoded_audio)


```

### `ImgToTransformer`
- takes in img -> patches -> reshapes to [B, SEQLEN, Dim] to align with transformer
```python
Expand Down
10 changes: 0 additions & 10 deletions gemini_torch/tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -150,13 +150,3 @@ def decode(self, tokens: List[int]) -> str:
return self.sp_model.decode(tokens)


# Example usage
tokenizer_name = "hf-internal-testing/llama-tokenizer"
tokenizer = MultimodalSentencePieceTokenizer(tokenizer_name=tokenizer_name)

# Encoding and decoding examples
encoded_audio = tokenizer.encode("Audio description", modality="audio")
decoded_audio = tokenizer.decode(encoded_audio)

print("Encoded audio:", encoded_audio)
print("Decoded audio:", decoded_audio)
12 changes: 12 additions & 0 deletions tokenizer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
from gemini_torch.tokenizer import MultimodalSentencePieceTokenizer

# Example usage
tokenizer_name = "hf-internal-testing/llama-tokenizer"
tokenizer = MultimodalSentencePieceTokenizer(tokenizer_name=tokenizer_name)

# Encoding and decoding examples
encoded_audio = tokenizer.encode("Audio description", modality="audio")
decoded_audio = tokenizer.decode(encoded_audio)

print("Encoded audio:", encoded_audio)
print("Decoded audio:", decoded_audio)

0 comments on commit 5324ce3

Please sign in to comment.