Skip to content

Creating a Thinking Dataset: Leveraging Real-World Data for Strategic Business Insights and STaR Case Study Generation.

License

Notifications You must be signed in to change notification settings

MultiTonic/thinking-dataset

Repository files navigation

Thinking Dataset: Leveraging Real-World Data for Strategic Business Insights

License Python Version CodeQL Pylint Pytest

Table of Contents

Overview

Thinking Dataset leverages real-world data and efficient pipelines for business insights. We integrate robust data workflows (PDF to SQL ingestion), tactical TDD (details), and a streamlined SQLite design (details). Multi-stage AI generates STaR case studies for actionable strategies.

Features

  • πŸ”„ End-to-End Pipeline: Download, process, transform, and store large datasets.
  • πŸ’Ύ SQLite Backend: Lightweight, fast, and easy to manage, with optional Parquet.
  • βœ… Comprehensive Testing: Thorough TDD coverage and validation checks.
  • πŸ–₯️ Flexible CLI: Modular Click commands for quick execution of tasks.
  • πŸ”€ Data Transformation: Granular pipes for cleaning, merging, and deriving data.
  • πŸ“š STaR Case Studies: Generate synthetic scenarios alongside real data for deeper insights.
  • ⚑ Parallel Execution: Efficiently process big data with optional concurrency.
  • πŸ”’ Secure Config: Manage environment variables discretely with .env files.

Quick Start

Prerequisites

  • Python 3.12 or later
  • Git
  • A cloud-based account (e.g., Huggingface) or a GPU (RTX 3090 or greater) for processing, or both

Setup

  1. Clone the repository:

    git clone https://github.com/MultiTonic/thinking-dataset.git
    cd thinking-dataset
  2. Install uv package manager:

    First add the package into the global environment:

    pip install uv

    Then add uv tools directory to PATH*:

    uv tool update-shell
  3. Set up the project:

    uv run setup

    *You may need to restart your terminal session for the changes to update.

This will create a virtual environment, install the project dependencies, and activate the virtual environment.

  1. Set up environment variables:

    Copy the .env.sample file to .env and change the values as needed:

    cp .env.sample .env

    Update the .env file with your credentials:

    # Required settings
    HF_ORG="my_huggingface_organization"
    HF_USER="my_huggingface_username"
    HF_READ_TOKEN="my_huggingface_read_access_token"
    HF_WRITE_TOKEN="my_huggingface_write_access_token"
    
    # Required configuration
    CONFIG_PATH="config/config.yaml"
    
    # One or more providers
    OLLAMA_SERVER_URL="http://localhost:11434"
    OPENAI_API_TOKEN="your_openai_api_token"
    RUNPOD_API_TOKEN="your_runpod_api_token"

Usage

For complete usage instructions and examples, see the Usage Guide.

Running the Download Command

To download all parquet files from the Cablegate dataset using Hugging Face CLI:

thinking-dataset download

Running All CLI Commands

To execute all CLI commands for the project:

python assets/scripts/run_cli_commands.py

Project Structure

The following directory structure provides an overview of how the project is organized:

thinking-dataset/
β”œβ”€β”€ config/                 # Configuration files
β”œβ”€β”€ assets/                 # Assets directory for external resources
β”‚   β”œβ”€β”€ prompts/            # Prompt templates
β”‚   β”œβ”€β”€ scripts/            # Utility scripts
β”‚   β”œβ”€β”€ resources/          # External project data
β”‚   β”œβ”€β”€ templates/          # JSON prompt templates
β”œβ”€β”€ data/                   # Data directory
β”œβ”€β”€ docs/                   # Project documentation
β”œβ”€β”€ reports/                # Generated reports
β”œβ”€β”€ tests/                  # Test files
β”œβ”€β”€ thinking_dataset/       # Core project code
β”‚   β”œβ”€β”€ commands/           # CLI command implementations
β”‚   β”œβ”€β”€ connectors/         # Data connectors
β”‚   β”œβ”€β”€ config/             # Configuration loaders and management
β”‚   β”œβ”€β”€ datasets/           # Dataset definitions and processing
β”‚   β”‚   β”œβ”€β”€ operations/     # Data operations and transformations
β”‚   β”œβ”€β”€ db/                 # Database support
β”‚   β”‚   β”œβ”€β”€ operations/     # Database operations and transactions
β”‚   β”œβ”€β”€ dto/                # Data Transfer Objects (DTO)
β”‚   β”œβ”€β”€ io/                 # File I/O operations
β”‚   β”œβ”€β”€ pipeworks/          # Pipelines and pipes for data processing
β”‚   β”‚   β”œβ”€β”€ pipelines/      # Pipeline management and control
β”‚   β”‚   β”œβ”€β”€ pipes/          # Pipes used for data frame processing
β”‚   β”œβ”€β”€ providers/          # AI data providers
β”‚   β”œβ”€β”€ tonics/             # Data utility functions and helpers
β”‚   β”œβ”€β”€ utils/              # General-purpose utility helpers
β”‚   β”œβ”€β”€ main.py             # Main execution file
└── setup.py                # Project setup
└── .env                    # Private environment variables file

Contributing

Contributions are welcome! Fork the repository, make your changes, and create a pull request. Ensure your code follows the project's standards and includes tests. See Contributing for guidelines.

Resources

License

This dataset is licensed under the MIT License.

Citations

Please use the following BibTeX entry to cite this dataset:

@software{thinking-dataset,
  author = {Kara Rawson, Joseph Pollack, and et al.},
  title = {Thinking-Dataset: Leveraging Real-World Data for Strategic Business Insights and STaR Case Study Generation},
  year = {2025},
  howpublished = {\url{https://github.com/MultiTonic/thinking-dataset}},
  note = {Accessed: 2025-01-25}
}

Acknowledgements

Special thanks to our contributors:

  • Kara Rawson - Lead Engineer
  • Joseph Pollack - Creator & Business Leader
  • MultiTonic Team - Support and Collaboration
  • Hugging Face - Robust tools and infrastructure for dataset management

Contact

For questions or support, please contact us at: