OmniLake is a centralized data repository system built to support AI initiatives. It provides a scalable and efficient way to ingest, store, process, and retrieve information, enabling powerful AI-driven querying and analysis.
- Modular, event-driven architecture for scalability and flexibility
- Efficient vector storage and retrieval using LanceDB and S3
- AI-powered data compaction and summarization
- Flexible archive and entry management for organized data storage
- Robust job handling for asynchronous operations
- Semantic search capabilities using vector embeddings
- Integration with Amazon Bedrock for advanced AI functionalities
- Automated vector store management and optimization
-
Architecture
- Modular design with separate services for API handling, data ingestion, storage management, and response generation
- Event-driven architecture using AWS EventBridge for asynchronous processing
- Serverless implementation leveraging AWS Lambda for scalability and cost-efficiency
-
Data Ingestion
- Processes incoming data, extracting metadata and insights
- Chunks large text documents for efficient storage and retrieval Generates vector embeddings for semantic search capabilities
- Supports various source types (e.g., files, websites, transcripts)
-
Storage
- Uses DynamoDB for metadata storage (archives, entries, jobs, etc.)
- Implements vector storage using LanceDB and S3 for efficient similarity search
- Manages multiple vector stores per archive for optimized performance
- Includes automatic rebalancing of vector stores based on content and usage patterns
-
Information Retrieval
- Provides semantic search capabilities using vector embeddings
- Implements a multi-stage compaction process for summarizing large amounts of information
- Generates AI-powered responses to user queries using language models (via Amazon Bedrock)
-
Job Management
- Tracks and manages asynchronous operations throughout the system
- Supports long-running tasks like data ingestion, vector store rebalancing, and response generation
-
API and Client Library
- Offers a REST API for external interactions
- Provides a Python client library for easy integration with other applications
-
Key Concepts
- Archives: Logical groupings of related data
- Entries: Individual pieces of content within archives
- Sources: Track the origin and provenance of data
- Vector Stores: Efficient storage and retrieval of vector embeddings
-
AI Integration
- Uses Amazon Bedrock for generating vector embeddings
- Leverages large language models for content summarization and response generation
- Implements AI-driven insights extraction from ingested content
-
Scalability and Maintenance:
- Includes automated processes for vector store management and optimization
- Implements maintenance modes for archives during large-scale operations
- Provides mechanisms for recalculating and updating metadata and tags
-
Development and Deployment:
- Uses AWS CDK for infrastructure-as-code and deployment
- Implements a development environment setup script (dev.sh) for easy onboarding
- Uses Poetry for Python dependency management
Handles external interactions and manages the public-facing API.
Processes new data entries, extracting metadata and preparing content for storage.
Manages vector stores and data persistence, including rebalancing and optimization.
Handles information requests, generating AI-powered responses based on stored data.
- Archives: Logical groupings of related data
- Entries: Individual pieces of content within archives
- Sources: Tracking of data provenance
- Jobs: Management of asynchronous processing tasks
- AWS Services: DynamoDB, S3, EventBridge, Lambda
- Vector Storage: LanceDB
- AI/ML: Amazon Bedrock for embeddings and language model inference
- Infrastructure: AWS CDK for deployment
- Python 3.12 or higher
- Poetry (Python package manager)
- AWS CLI configured with appropriate credentials
- Clone the repository:
git clone https://github.com/your-repo/omnilake.git
cd omnilake
- Install dependencies using Poetry:
poetry install
- Set up the development environment:
./dev.sh
This script sets up necessary environment variables and prepares your local development environment.
Here's a basic example of how to use the OmniLake client library to interact with the system:
from omnilake.client.client import OmniLake
from omnilake.client.request_definitions import AddEntry, CreateArchive
# Initialize the OmniLake client
omnilake = OmniLake()
# Create a new archive
archive_req = CreateArchive(
archive_id='my_archive',
description='My first OmniLake archive'
)
omnilake.create_archive(archive_req)
# Add an entry to the archive
entry_req = AddEntry(
archive_id='my_archive',
content='This is a sample entry in my OmniLake archive.',
sources=['https://example.com/source']
)
result = omnilake.add_entry(entry_req)
print(f"Entry added with ID: {result.response_body['entry_id']}")
# Request information
info_req = InformationRequest(
archive_id='my_archive',
goal='Summarize the contents of the archive',
request='What information is stored in this archive?',
request_type='INCLUSIVE'
)
response = omnilake.request_information(info_req)
print(f"Information request submitted. Job ID: {response.response_body['job_id']}")