- Overview
- Features
- Installation
- Usage
- Quick Start
- Project Structure
- Contributing
- Resources
- License
- Citations
- Acknowledgements
- Contact
Thinking Dataset leverages real-world data and efficient pipelines for business insights. We integrate robust data workflows (PDF to SQL ingestion), tactical TDD (details), and a streamlined SQLite design (details). Multi-stage AI generates STaR case studies for actionable strategies.
- π End-to-End Pipeline: Download, process, transform, and store large datasets.
- πΎ SQLite Backend: Lightweight, fast, and easy to manage, with optional Parquet.
- β Comprehensive Testing: Thorough TDD coverage and validation checks.
- π₯οΈ Flexible CLI: Modular Click commands for quick execution of tasks.
- π Data Transformation: Granular pipes for cleaning, merging, and deriving data.
- π STaR Case Studies: Generate synthetic scenarios alongside real data for deeper insights.
- β‘ Parallel Execution: Efficiently process big data with optional concurrency.
- π Secure Config: Manage environment variables discretely with .env files.
- Python 3.12 or later
- Git
- A cloud-based account (e.g., Huggingface) or a GPU (RTX 3090 or greater) for processing, or both
-
Clone the repository:
git clone https://github.com/MultiTonic/thinking-dataset.git cd thinking-dataset
-
Install
uv
package manager:First add the package into the global environment:
pip install uv
Then add uv tools directory to PATH*:
uv tool update-shell
-
Set up the project:
uv run setup
*You may need to restart your terminal session for the changes to update.
This will create a virtual environment, install the project dependencies, and activate the virtual environment.
-
Set up environment variables:
Copy the
.env.sample
file to.env
and change the values as needed:cp .env.sample .env
Update the
.env
file with your credentials:# Required settings HF_ORG="my_huggingface_organization" HF_USER="my_huggingface_username" HF_READ_TOKEN="my_huggingface_read_access_token" HF_WRITE_TOKEN="my_huggingface_write_access_token" # Required configuration CONFIG_PATH="config/config.yaml" # One or more providers OLLAMA_SERVER_URL="http://localhost:11434" OPENAI_API_TOKEN="your_openai_api_token" RUNPOD_API_TOKEN="your_runpod_api_token"
For complete usage instructions and examples, see the Usage Guide.
To download all parquet files from the Cablegate dataset using Hugging Face CLI:
thinking-dataset download
To execute all CLI commands for the project:
python assets/scripts/run_cli_commands.py
The following directory structure provides an overview of how the project is organized:
thinking-dataset/
βββ config/ # Configuration files
βββ assets/ # Assets directory for external resources
β βββ prompts/ # Prompt templates
β βββ scripts/ # Utility scripts
β βββ resources/ # External project data
β βββ templates/ # JSON prompt templates
βββ data/ # Data directory
βββ docs/ # Project documentation
βββ reports/ # Generated reports
βββ tests/ # Test files
βββ thinking_dataset/ # Core project code
β βββ commands/ # CLI command implementations
β βββ connectors/ # Data connectors
β βββ config/ # Configuration loaders and management
β βββ datasets/ # Dataset definitions and processing
β β βββ operations/ # Data operations and transformations
β βββ db/ # Database support
β β βββ operations/ # Database operations and transactions
β βββ dto/ # Data Transfer Objects (DTO)
β βββ io/ # File I/O operations
β βββ pipeworks/ # Pipelines and pipes for data processing
β β βββ pipelines/ # Pipeline management and control
β β βββ pipes/ # Pipes used for data frame processing
β βββ providers/ # AI data providers
β βββ tonics/ # Data utility functions and helpers
β βββ utils/ # General-purpose utility helpers
β βββ main.py # Main execution file
βββ setup.py # Project setup
βββ .env # Private environment variables file
Contributions are welcome! Fork the repository, make your changes, and create a pull request. Ensure your code follows the project's standards and includes tests. See Contributing for guidelines.
This dataset is licensed under the MIT License.
Please use the following BibTeX entry to cite this dataset:
@software{thinking-dataset,
author = {Kara Rawson, Joseph Pollack, and et al.},
title = {Thinking-Dataset: Leveraging Real-World Data for Strategic Business Insights and STaR Case Study Generation},
year = {2025},
howpublished = {\url{https://github.com/MultiTonic/thinking-dataset}},
note = {Accessed: 2025-01-25}
}
Special thanks to our contributors:
- Kara Rawson - Lead Engineer
- Joseph Pollack - Creator & Business Leader
- MultiTonic Team - Support and Collaboration
- Hugging Face - Robust tools and infrastructure for dataset management
For questions or support, please contact us at:
- Email: [email protected]
- Discord: Join our Discord