Add `normalize_embeddings` Argument to `SentenceTransformer` for Simplified Embedding Normalization #3064

AIMacGyver · 2024-11-17T08:57:57Z

Currently, embedding normalization in SentenceTransformer can be achieved in two ways:

Adding a Normalize module to the model pipeline
Manually normalizing embeddings post-encode

Both approaches work but can add complexity and may not align seamlessly with production deployment workflows.

Feature Request:
Add normalize_embeddings as an argument to SentenceTransformer.__init__ that would be passed through to encode methods, similar to how truncate_dim works. This would provide a cleaner, built-in way to control normalization behavior.

Current Workarounds:
Currently, we need to either:

Subclass SentenceTransformer to add normalization
Apply normalization post-encode
Always include a Normalize module

Example Subclass Workaround

import numpy as np
import torch
import torch.nn.functional as F

from torch import Tensor
from sentence_transformers import SentenceTransformer 

class NormalizedSentenceTransformer(SentenceTransformer):
    def encode(self, *args, **kwargs):
        embeddings = super().encode(*args, **kwargs)
        if isinstance(embeddings, np.ndarray):
            embeddings = torch.tensor(embeddings, dtype=torch.float32)
            embeddings = F.normalize(embeddings, p=2, dim=1)
            return embeddings.numpy()
        elif isinstance(embeddings, list):
            return [F.normalize(embedding, p=2, dim=1) for embedding in embeddings]
        elif isinstance(embeddings, Tensor):
            return F.normalize(embeddings, p=2, dim=1)
        else:
            raise ValueError(f"Unsupported type for embeddings: {type(embeddings)}")

Questions for Discussion:

How should this interact with models that already include a Normalize module? Should we:
- Skip additional normalization if a Normalize module is present?
- Allow override via the normalize_embeddings parameter?
- Raise a warning to avoid redundancy?
Should this behavior be configurable in the config_sentence_transformers.json file, like other model options?

Use Case:
When deploying embedding models to production serving endpoints (e.g., Databricks via MLflow), having normalization as a built-in configurable parameter would:

Eliminate the need to wrap or subclass models.
Reduce the risk of unnormalized embeddings in similarity or clustering tasks, which can drastically impact performance.
Align with production-friendly design by providing a clean and intuitive API for normalization.

Example Usage

from sentence_transformers import SentenceTransformer

# Normalize embeddings during encoding
model = SentenceTransformer(
    model_name_or_path="jinaai/jina-embeddings-v2-base-code",
    trust_remote_code=True,
    normalize_embeddings=True
)
embeddings = model.encode(["This is a test sentence."])

If this feature makes sense, I'd be happy to contribute by working on a PR or incorporating any guidance and feedback on the request.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `normalize_embeddings` Argument to `SentenceTransformer` for Simplified Embedding Normalization #3064

Add `normalize_embeddings` Argument to `SentenceTransformer` for Simplified Embedding Normalization #3064

AIMacGyver commented Nov 17, 2024

Add normalize_embeddings Argument to SentenceTransformer for Simplified Embedding Normalization #3064

Add normalize_embeddings Argument to SentenceTransformer for Simplified Embedding Normalization #3064

Comments

AIMacGyver commented Nov 17, 2024

Add `normalize_embeddings` Argument to `SentenceTransformer` for Simplified Embedding Normalization #3064

Add `normalize_embeddings` Argument to `SentenceTransformer` for Simplified Embedding Normalization #3064