Skip to content

Latest commit

 

History

History
256 lines (186 loc) · 13.1 KB

README.md

File metadata and controls

256 lines (186 loc) · 13.1 KB

Python Flare

Flare AI Consensus

Flare AI SDK for Consensus Learning.

🚀 Key Features

  • Consensus Learning Implementation A Python implementation of single-node, multi-model Consensus Learning (CL). CL is a decentralized ensemble learning paradigm introduced in arXiv:2402.16157, which is now being generalized to large language models (LLMs).

  • 300+ LLM Support Leverages OpenRouter to access over 300 models via a unified interface.

  • Iterative Feedback Loop Employs an aggregation process where multiple LLM outputs are refined over configurable iterations.

  • Modular & Configurable Easily customize models, conversation prompts, and aggregation parameters through a simple JSON configuration.

🎯 Getting Started

Before getting started, ensure you have:

Build & Run Instructions

You can deploy Flare AI Consensus using Docker or set up the backend and frontend manually.

Environment Setup

  1. Prepare the Environment File: Rename .env.example to .env and update the variables accordingly. (e.g. your OpenRouter API Key)

Build using Docker (Recommended)

  1. Build the Docker Image:

    docker build -t flare-ai-consensus .
  2. Run the Docker Container:

    docker run -p 80:80 -it --env-file .env flare-ai-consensus
  3. Access the Frontend: Open your browser and navigate to http://localhost:80/docs to interact with the Chat UI.

🛠 Build Manually

Flare AI Consensus is a Python-based backend. Follow these steps for manual setup:

  1. Install Dependencies: Use uv to install backend dependencies:

    uv sync --all-extras

    Verify your available credits and get all supported models with:

    uv run python -m tests.credits
    uv run python -m tests.models
  2. Configure CL instance: Configure your CL instance in src/input.json, including:

    • Models: Specify each LLM's OpenRouter id, along with parameters like max_tokens and temperature.
    • Aggregator Settings: Define the aggregator model, additional context, aggregation prompt, and specify how aggregated responses are handled.
    • Iterations: Determine the number of iterations for the feedback loop.
  3. Start the Backend: The backend runs by default on 0.0.0.0:8080:

    uv run start-backend

Testing Endpoints

For granular testing, use the following endpoints:

  • Completion Endpoint (Non-Conversational):

    uv run python -m tests.completion --prompt "Who is Ash Ketchum?" --model "google/learnlm-1.5-pro-experimental:free"
  • Chat Completion Endpoint (Conversational):

    uv run python -m tests.chat_completion --mode default

    Tip: In interactive mode, type exit to quit.

📁 Repo Structure

src/flare_ai_consensus/
├── attestation/           # TEE attestation implementation
│   ├── simulated_token.txt
│   ├── vtpm_attestation.py
│   └── vtpm_validation.py
├── api/                    # API layer
│   ├── middleware/        # Request/response middleware
│   └── routes/           # API endpoint definitions
├── consensus/             # Core consensus learning
│   ├── aggregator.py      # Response aggregation
│   └── consensus.py       # Main CL implementation
├── router/               # API routing and model access
│   ├── base_router.py     # Base routing interface
│   └── openrouter.py      # OpenRouter implementation
├── utils/                # Utility functions
│   ├── file_utils.py      # File operations
│   └── parser_utils.py    # Input parsing
├── input.json            # Configuration file
├── main.py               # Application entry
└── settings.py           # Environment settings

🚀 Deploy on TEE

Deploy on a Confidential Space using AMD SEV.

Prerequisites

Environment Configuration

  1. Set Environment Variables: Update your .env file with:

    TEE_IMAGE_REFERENCE=ghcr.io/flare-research/flare-ai-consensus:main  # Replace with your repo build image
    INSTANCE_NAME=<PROJECT_NAME-TEAM_NAME>
  2. Load Environment Variables:

    source .env

    Reminder: Run the above command in every new shell session. On Windows, we recommend using git BASH to access commands like source.

  3. Verify the Setup:

    echo $TEE_IMAGE_REFERENCE # Expected output: Your repo build image

Deploying to Confidential Space

Run the following command:

gcloud compute instances create $INSTANCE_NAME \
  --project=verifiable-ai-hackathon \
  --zone=us-central1-c \
  --machine-type=n2d-standard-2 \
  --network-interface=network-tier=PREMIUM,nic-type=GVNIC,stack-type=IPV4_ONLY,subnet=default \
  --metadata=tee-image-reference=$TEE_IMAGE_REFERENCE,\
tee-container-log-redirect=true,\
tee-env-OPEN_ROUTER_API_KEY=$OPEN_ROUTER_API_KEY,\
  --maintenance-policy=MIGRATE \
  --provisioning-model=STANDARD \
  --service-account=confidential-sa@verifiable-ai-hackathon.iam.gserviceaccount.com \
  --scopes=https://www.googleapis.com/auth/cloud-platform \
  --min-cpu-platform="AMD Milan" \
  --tags=flare-ai,http-server,https-server \
  --create-disk=auto-delete=yes,\
boot=yes,\
device-name=$INSTANCE_NAME,\
image=projects/confidential-space-images/global/images/confidential-space-debug-250100,\
mode=rw,\
size=11,\
type=pd-standard \
  --shielded-secure-boot \
  --shielded-vtpm \
  --shielded-integrity-monitoring \
  --reservation-affinity=any \
  --confidential-compute-type=SEV

Post-deployment

  1. After deployment, you should see an output similar to:

    NAME          ZONE           MACHINE_TYPE    PREEMPTIBLE  INTERNAL_IP  EXTERNAL_IP    STATUS
    consensus-team1   us-central1-b  n2d-standard-2               10.128.0.18  34.41.127.200  RUNNING
    
  2. It may take a few minutes for Confidential Space to complete startup checks. You can monitor progress via the GCP Console logs. Click on Compute EngineVM Instances (in the sidebar) → Select your instanceSerial port 1 (console).

    When you see a message like:

    INFO:     Uvicorn running on http://0.0.0.0:80 (Press CTRL+C to quit)
    

    the container is ready. Navigate to the external IP of the instance (visible in the VM Instances page) to access the docs (<IP>:80/docs).

🔧 Troubleshooting

If you encounter issues, follow these steps:

  1. Check Logs:

    gcloud compute instances get-serial-port-output $INSTANCE_NAME --project=verifiable-ai-hackathon
  2. Verify API Key(s): Ensure that all API Keys are set correctly (e.g. OPEN_ROUTER_API_KEY).

  3. Check Firewall Settings: Confirm that your instance is publicly accessible on port 80.

💡 Next Steps

  • Security & TEE Integration:
    • Ensure execution within a Trusted Execution Environment (TEE) to maintain confidentiality and integrity.
  • Factual Correctness:
    • In line with the main theme of the hackathon, one important aspect of the outputs generated by the LLMs is their accuracy. In this regard, producing sources/citations with the answers would lead to higher trust in the setup. Sample prompts that can be used for this purpose can be found in the appendices of arXiv:2305.14627, or in James' Coffee Blog.
    • Note: only certain models may be suitable for this purpose, as references generated by LLMs are often inaccurate or not even real!
  • Prompt Engineering:
    • Our approach is very similar to the Mixture-of-Agents (MoA) introduced in arXiv:2406.04692, which uses iterative aggregations of model responses. Ther github repository does include other examples of prompts that can be used for additional context for the LLMs.
    • New iterations of the consensus learning algorithm could have different prompts for improving the previous responses. In this regard, the few-shot prompting techniques introduced by OpenAI in arXiv:2005.14165 work by providing models with a few examples of similar queries and responses in addition to the initial prompt. (See also previous work by Radford et al..)
    • Chain of Thought prompting techniques are a linear problem solving approach where each step builds upon the previous one. Google's approach in arXiv:2201.11903 is to augment each prompt with an additional example and chain of thought for an associated answer. (See the paper for multiple examples.)
  • Dynamic resource allocation and Semantic Filters:
    • An immediate improvement to the current approach would be to use dynamically-adjusted parameters. Namely, the number of iterations and number of models used in the algorithm could be adjusted to the input prompt: e.g. simple prompts do not require too many resources. For this, a centralized model could be used to decide the complexity of the task, prior to sending the prompt to the other LLMs.
    • On a similar note, the number of iterations for making progress could adjusted according to how different are the model responses. Semantic entailment for LLM outputs is an active field of research, but a rather quick solution is to rely on embeddings. These are commonly used in RAG pipelines, and could also be used here with e.g. cosine similarity. You can get started with GCloud's text embeddings -- see flare-ai-rag for more details.
    • The use of LLM-as-a-Judge for evaluating other LLM outputs has shown good progress -- see also this Confident AI blogpost.
    • In line with the previously mentioned LLM-as-a-Judge, a model could potentially be used for filtering bad responses. LLM-Blender, for instance, introduced in arXiv:2306.02561, uses a PairRanker that achieves a ranking of outputs through pairwise comparisons via a cross-attention encoder.
  • AI Agent Swarm:
    • The structure of the reference CL implementation can be changed to adapt swarm-type algorithms, where tasks are broken down and distributed among specialized agents for parallel processing. In this case a centralized LLM would act as an orchestrator for managing distribution of tasks -- see e.g. swarms repo.