Skip to content

Commit

Permalink
Merge pull request #2012 from priyashuu/tsne-visualization
Browse files Browse the repository at this point in the history
added tsne-visulisations
  • Loading branch information
ajay-dhangar authored Nov 10, 2024
2 parents ec7ab3e + 3a5fe0e commit 2137281
Showing 1 changed file with 96 additions and 0 deletions.
96 changes: 96 additions & 0 deletions docs/machine-learning/tsne-visualization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
---

id: tsne-visualization
title: t-Distributed Stochastic Neighbor Embedding (t-SNE) Algorithm
sidebar_label: t-SNE Visualization
description: "An overview of t-SNE, a popular technique for visualizing high-dimensional data in two or three dimensions."
tags: [machine learning, data visualization, dimensionality reduction, t-SNE, algorithms]

---

### Definition:
**t-Distributed Stochastic Neighbor Embedding (t-SNE)** is a nonlinear dimensionality reduction technique commonly used for visualizing high-dimensional data. By mapping data points to a lower-dimensional space (typically two or three dimensions), t-SNE preserves the local structure of the data, making patterns and clusters more discernible.

### Characteristics:
- **Nonlinear Dimensionality Reduction**:
Unlike linear techniques like PCA, t-SNE captures the complex relationships between data points, making it suitable for data with intricate structures.

- **Focus on Local Structure**:
t-SNE emphasizes preserving the relative distances of nearby points while de-emphasizing larger pairwise distances. This helps reveal the underlying structure in clusters of data.

### How It Works:
t-SNE minimizes the divergence between two distributions: one that measures pairwise similarities in the original high-dimensional space and another in the lower-dimensional space. The algorithm works as follows:

1. **Pairwise Similarities**:
Calculate the pairwise similarities between points using a Gaussian distribution in the high-dimensional space.

2. **Low-Dimensional Mapping**:
Initialize the data points randomly in the lower-dimensional space and compute their similarities using a Student’s t-distribution (hence "t-SNE").

3. **Optimization**:
Minimize the Kullback–Leibler divergence between the two similarity distributions using gradient descent.

### Problem Statement:
Integrate t-SNE visualization as a feature to aid users in interpreting and analyzing high-dimensional datasets by reducing them to 2D or 3D representations that can reveal clusters, patterns, and anomalies.

### Key Concepts:
- **Perplexity**:
A hyperparameter in t-SNE that balances attention between local and global aspects of the data. Typical values range from 5 to 50.

- **Learning Rate**:
Affects the speed of convergence. Too low a rate can result in poor convergence, while too high a rate can lead to data artifacts.

- **High-Dimensional Similarities**:
Defined using conditional probabilities based on Gaussian kernels.

- **Low-Dimensional Embedding**:
Uses a Student's t-distribution to prevent "crowding" in the lower-dimensional space, where distant points stay separated.

### Example Usage:
Consider a dataset containing 1000 samples, each with 50 features:

```python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

# Example data: synthetic dataset
X = np.random.rand(1000, 50) # 1000 samples, 50 features

# Apply t-SNE to reduce dimensions to 2D
tsne = TSNE(n_components=2, perplexity=30, learning_rate=200, random_state=42)
X_embedded = tsne.fit_transform(X)

# Plot the 2D t-SNE visualization
plt.figure(figsize=(10, 6))
plt.scatter(X_embedded[:, 0], X_embedded[:, 1], c='blue', alpha=0.6)
plt.title('t-SNE Visualization of High-Dimensional Data')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.show()
```

### Considerations:
- **Computationally Intensive**:
t-SNE can be slow for large datasets due to the pairwise similarity calculations and optimization process. Various optimized implementations (e.g., Barnes-Hut t-SNE) help reduce the runtime.

- **Interpretation**:
While t-SNE is excellent for visualizing clusters, the distances between clusters may not be as meaningful as the intra-cluster distances.

- **Preprocessing**:
It’s beneficial to scale and preprocess data (e.g., using PCA for initial reduction) to enhance the quality and performance of t-SNE.

### Benefits:
- Reveals hidden structures in data that linear methods may miss.
- Suitable for exploring complex datasets such as images, word embeddings, or genomic data.
- Enhances data analysis, pattern recognition, and exploratory data analysis (EDA).

### Challenges:
- Requires careful tuning of hyperparameters like perplexity and learning rate.
- Sensitive to scale; data preprocessing is crucial for optimal results.
- The visualization outcome can vary between runs due to the non-convex optimization.

### Conclusion:
t-SNE has become a powerful tool for visualizing and understanding high-dimensional data, especially in cases where simpler techniques fail to reveal meaningful structures. Integrating t-SNE visualizations into projects provides users with an intuitive way to explore complex datasets, spot clusters, and identify underlying relationships.

---

0 comments on commit 2137281

Please sign in to comment.