Thinking Dataset: Leveraging Real-World Data for Strategic Business Insights

Overview

Thinking Dataset leverages real-world data and efficient pipelines for business insights. We integrate robust data workflows (PDF to SQL ingestion), tactical TDD (details), and a streamlined SQLite design (details). Multi-stage AI generates STaR case studies for actionable strategies.

Features

🔄 End-to-End Pipeline: Download, process, transform, and store large datasets.
💾 SQLite Backend: Lightweight, fast, and easy to manage, with optional Parquet.
✅ Comprehensive Testing: Thorough TDD coverage and validation checks.
🖥️ Flexible CLI: Modular Click commands for quick execution of tasks.
🔀 Data Transformation: Granular pipes for cleaning, merging, and deriving data.
📚 STaR Case Studies: Generate synthetic scenarios alongside real data for deeper insights.
⚡ Parallel Execution: Efficiently process big data with optional concurrency.
🔒 Secure Config: Manage environment variables discretely with .env files.

Quick Start

Prerequisites

Python 3.12 or later
Git
A cloud-based account (e.g., Huggingface) or a GPU (RTX 3090 or greater) for processing, or both

Setup

Clone the repository:

git clone https://github.com/MultiTonic/thinking-dataset.git
cd thinking-dataset

Install uv package manager:

First add the package into the global environment:
```
pip install uv
```
Then add uv tools directory to PATH*:
```
uv tool update-shell
```
Set up the project:
```
uv run setup
```
*You may need to restart your terminal session for the changes to update.

This will create a virtual environment, install the project dependencies, and activate the virtual environment.

Set up environment variables:

Copy the .env.sample file to .env and change the values as needed:

cp .env.sample .env

Update the .env file with your credentials:

# Required settings
HF_ORG="my_huggingface_organization"
HF_USER="my_huggingface_username"
HF_READ_TOKEN="my_huggingface_read_access_token"
HF_WRITE_TOKEN="my_huggingface_write_access_token"

# Required configuration
CONFIG_PATH="config/config.yaml"

# One or more providers
OLLAMA_SERVER_URL="http://localhost:11434"
OPENAI_API_TOKEN="your_openai_api_token"
RUNPOD_API_TOKEN="your_runpod_api_token"

Usage

For complete usage instructions and examples, see the Usage Guide.

Running the Download Command

To download all parquet files from the Cablegate dataset using Hugging Face CLI:

thinking-dataset download

Running All CLI Commands

To execute all CLI commands for the project:

python assets/scripts/run_cli_commands.py

Project Structure

The following directory structure provides an overview of how the project is organized:

thinking-dataset/
├── config/                 # Configuration files
├── assets/                 # Assets directory for external resources
│   ├── prompts/            # Prompt templates
│   ├── scripts/            # Utility scripts
│   ├── resources/          # External project data
│   ├── templates/          # JSON prompt templates
├── data/                   # Data directory
├── docs/                   # Project documentation
├── reports/                # Generated reports
├── tests/                  # Test files
├── thinking_dataset/       # Core project code
│   ├── commands/           # CLI command implementations
│   ├── connectors/         # Data connectors
│   ├── config/             # Configuration loaders and management
│   ├── datasets/           # Dataset definitions and processing
│   │   ├── operations/     # Data operations and transformations
│   ├── db/                 # Database support
│   │   ├── operations/     # Database operations and transactions
│   ├── dto/                # Data Transfer Objects (DTO)
│   ├── io/                 # File I/O operations
│   ├── pipeworks/          # Pipelines and pipes for data processing
│   │   ├── pipelines/      # Pipeline management and control
│   │   ├── pipes/          # Pipes used for data frame processing
│   ├── providers/          # AI data providers
│   ├── tonics/             # Data utility functions and helpers
│   ├── utils/              # General-purpose utility helpers
│   ├── main.py             # Main execution file
└── setup.py                # Project setup
└── .env                    # Private environment variables file

Contributing

Contributions are welcome! Fork the repository, make your changes, and create a pull request. Ensure your code follows the project's standards and includes tests. See Contributing for guidelines.

Resources

License

This dataset is licensed under the MIT License.

Citations

Please use the following BibTeX entry to cite this dataset:

@software{thinking-dataset,
  author = {Kara Rawson, Joseph Pollack, and et al.},
  title = {Thinking-Dataset: Leveraging Real-World Data for Strategic Business Insights and STaR Case Study Generation},
  year = {2025},
  howpublished = {\url{https://github.com/MultiTonic/thinking-dataset}},
  note = {Accessed: 2025-01-25}
}

Acknowledgements

Special thanks to our contributors:

Kara Rawson - Lead Engineer
Joseph Pollack - Creator & Business Leader
MultiTonic Team - Support and Collaboration
Hugging Face - Robust tools and infrastructure for dataset management

Contact

For questions or support, please contact us at:

Email: [email protected]
Discord: Join our Discord

Name		Name	Last commit message	Last commit date
Latest commit History 363 Commits
.github		.github
assets		assets
config		config
docs		docs
tests		tests
thinking_dataset		thinking_dataset
.env.sample		.env.sample
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Thinking Dataset: Leveraging Real-World Data for Strategic Business Insights

Table of Contents

Overview

Features

Quick Start

Prerequisites

Setup

Usage

Running the Download Command

Running All CLI Commands

Project Structure

Contributing

Resources

License

Citations

Acknowledgements

Contact

About

Releases 2

Packages

Contributors 7

Languages

License

MultiTonic/thinking-dataset

Folders and files

Latest commit

History

Repository files navigation

Thinking Dataset: Leveraging Real-World Data for Strategic Business Insights

Table of Contents

Overview

Features

Quick Start

Prerequisites

Setup

Usage

Running the Download Command

Running All CLI Commands

Project Structure

Contributing

Resources

License

Citations

Acknowledgements

Contact

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 7

Languages

Packages