-
-
Notifications
You must be signed in to change notification settings - Fork 211
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #2012 from priyashuu/tsne-visualization
added tsne-visulisations
- Loading branch information
Showing
1 changed file
with
96 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
--- | ||
|
||
id: tsne-visualization | ||
title: t-Distributed Stochastic Neighbor Embedding (t-SNE) Algorithm | ||
sidebar_label: t-SNE Visualization | ||
description: "An overview of t-SNE, a popular technique for visualizing high-dimensional data in two or three dimensions." | ||
tags: [machine learning, data visualization, dimensionality reduction, t-SNE, algorithms] | ||
|
||
--- | ||
|
||
### Definition: | ||
**t-Distributed Stochastic Neighbor Embedding (t-SNE)** is a nonlinear dimensionality reduction technique commonly used for visualizing high-dimensional data. By mapping data points to a lower-dimensional space (typically two or three dimensions), t-SNE preserves the local structure of the data, making patterns and clusters more discernible. | ||
|
||
### Characteristics: | ||
- **Nonlinear Dimensionality Reduction**: | ||
Unlike linear techniques like PCA, t-SNE captures the complex relationships between data points, making it suitable for data with intricate structures. | ||
|
||
- **Focus on Local Structure**: | ||
t-SNE emphasizes preserving the relative distances of nearby points while de-emphasizing larger pairwise distances. This helps reveal the underlying structure in clusters of data. | ||
|
||
### How It Works: | ||
t-SNE minimizes the divergence between two distributions: one that measures pairwise similarities in the original high-dimensional space and another in the lower-dimensional space. The algorithm works as follows: | ||
|
||
1. **Pairwise Similarities**: | ||
Calculate the pairwise similarities between points using a Gaussian distribution in the high-dimensional space. | ||
|
||
2. **Low-Dimensional Mapping**: | ||
Initialize the data points randomly in the lower-dimensional space and compute their similarities using a Student’s t-distribution (hence "t-SNE"). | ||
|
||
3. **Optimization**: | ||
Minimize the Kullback–Leibler divergence between the two similarity distributions using gradient descent. | ||
|
||
### Problem Statement: | ||
Integrate t-SNE visualization as a feature to aid users in interpreting and analyzing high-dimensional datasets by reducing them to 2D or 3D representations that can reveal clusters, patterns, and anomalies. | ||
|
||
### Key Concepts: | ||
- **Perplexity**: | ||
A hyperparameter in t-SNE that balances attention between local and global aspects of the data. Typical values range from 5 to 50. | ||
|
||
- **Learning Rate**: | ||
Affects the speed of convergence. Too low a rate can result in poor convergence, while too high a rate can lead to data artifacts. | ||
|
||
- **High-Dimensional Similarities**: | ||
Defined using conditional probabilities based on Gaussian kernels. | ||
|
||
- **Low-Dimensional Embedding**: | ||
Uses a Student's t-distribution to prevent "crowding" in the lower-dimensional space, where distant points stay separated. | ||
|
||
### Example Usage: | ||
Consider a dataset containing 1000 samples, each with 50 features: | ||
|
||
```python | ||
import numpy as np | ||
import matplotlib.pyplot as plt | ||
from sklearn.manifold import TSNE | ||
|
||
# Example data: synthetic dataset | ||
X = np.random.rand(1000, 50) # 1000 samples, 50 features | ||
|
||
# Apply t-SNE to reduce dimensions to 2D | ||
tsne = TSNE(n_components=2, perplexity=30, learning_rate=200, random_state=42) | ||
X_embedded = tsne.fit_transform(X) | ||
|
||
# Plot the 2D t-SNE visualization | ||
plt.figure(figsize=(10, 6)) | ||
plt.scatter(X_embedded[:, 0], X_embedded[:, 1], c='blue', alpha=0.6) | ||
plt.title('t-SNE Visualization of High-Dimensional Data') | ||
plt.xlabel('t-SNE Component 1') | ||
plt.ylabel('t-SNE Component 2') | ||
plt.show() | ||
``` | ||
|
||
### Considerations: | ||
- **Computationally Intensive**: | ||
t-SNE can be slow for large datasets due to the pairwise similarity calculations and optimization process. Various optimized implementations (e.g., Barnes-Hut t-SNE) help reduce the runtime. | ||
|
||
- **Interpretation**: | ||
While t-SNE is excellent for visualizing clusters, the distances between clusters may not be as meaningful as the intra-cluster distances. | ||
|
||
- **Preprocessing**: | ||
It’s beneficial to scale and preprocess data (e.g., using PCA for initial reduction) to enhance the quality and performance of t-SNE. | ||
|
||
### Benefits: | ||
- Reveals hidden structures in data that linear methods may miss. | ||
- Suitable for exploring complex datasets such as images, word embeddings, or genomic data. | ||
- Enhances data analysis, pattern recognition, and exploratory data analysis (EDA). | ||
|
||
### Challenges: | ||
- Requires careful tuning of hyperparameters like perplexity and learning rate. | ||
- Sensitive to scale; data preprocessing is crucial for optimal results. | ||
- The visualization outcome can vary between runs due to the non-convex optimization. | ||
|
||
### Conclusion: | ||
t-SNE has become a powerful tool for visualizing and understanding high-dimensional data, especially in cases where simpler techniques fail to reveal meaningful structures. Integrating t-SNE visualizations into projects provides users with an intuitive way to explore complex datasets, spot clusters, and identify underlying relationships. | ||
|
||
--- |