TON Copilot Dataset

Project Overview

TON Copilot Dataset is an initiative to create a comprehensive and continuously updated dataset for training Large Language Models (LLMs) specialized in the Telegram Open Network (TON) ecosystem. This dataset aims to serve as a foundation for developing AI-powered assistants and tools that can effectively support developers, users, and enthusiasts within the TON ecosystem.

Installation and Usage

Prerequisites

Node.js (version 14 or later)
npm (usually comes with Node.js)
Git

Installation

Clone the repository:

git clone https://github.com/your-username/ton-copilot-dataset.git
cd ton-copilot-dataset

Install dependencies:
```
npm install
```
Build the project:
```
npm run build
```

Setting up Environment Variables

This project uses environment variables for configuration. Follow these steps to set up your environment:

Create a .env file in the root directory of the project.
Add your GitHub personal access token to the .env file:
```
GITHUB_TOKEN=your_github_personal_access_token_here
```
Replace your_github_personal_access_token_here with your actual GitHub token.

Usage

Listing Repositories

To list repositories for GitHub accounts:

Create a file named github_account_list.txt in the project root directory. Add GitHub account names (users or organizations), one per line.
Run the following command:
```
npm run list-repos
```
This will generate a list of repositories in ./misc/github_repos_list.json.

Cloning Repositories

To clone the repositories:

Ensure you have a JSON file with the repository information (e.g., github_repos_list.json).
Run the following command:
```
npm run clone-repos -- /path/to/your/github_repos_list.json -o ./cloned_repos
```
Replace /path/to/your/github_repos_list.json with the actual path to your JSON file.

This will clone the repositories into the ./cloned_repos directory, organized by account name.

Additional Commands

Format code:
```
npm run format
```
Check code formatting:
```
npm run format:check
```

For more detailed information about each command, refer to the source code or run the command with the --help flag.

Key Features

Multi-source Data Collection
- Official documentation
- GitHub repositories
- Telegram group messages
Continuous Updates
- Automated data ingestion from various sources
- Regular refresh cycles to maintain relevance
Domain-specific Focus
- Tailored for TON ecosystem expertise
- Covers smart contracts, blockchain architecture, and TON-specific protocols

Methodology for Building a Domain-Specific LLM Dataset

Source Identification
- Identify authoritative and relevant data sources
- Ensure diverse representation of content types
Data Collection
- Implement web scraping for documentation and GitHub repos
- Utilize Telegram API for group message extraction
- Establish data collection frequency and update mechanisms
Data Preprocessing
- Clean and normalize text data
- Remove duplicates and irrelevant content
- Standardize formatting across different sources
Data Annotation
- Develop a tagging system for content categorization
- Implement named entity recognition for TON-specific terms
- Create a glossary of domain-specific terminology
Quality Assurance
- Establish manual review processes for data accuracy
- Implement automated checks for data integrity
- Regularly validate dataset against expert knowledge
Version Control
- Maintain dataset versioning for traceability
- Document changes and updates between versions
Ethical Considerations
- Ensure compliance with data privacy regulations
- Obtain necessary permissions for data usage
- Anonymize personal information in Telegram messages
Scalability Planning
- Design data pipeline for handling increasing data volumes
- Implement efficient storage and retrieval mechanisms
Evaluation Metrics
- Develop benchmarks for assessing dataset quality
- Create test sets for evaluating LLM performance on TON-specific tasks
Community Engagement
- Establish feedback loops with TON developers and users
- Incorporate community contributions and corrections

By following this methodology, the TON Copilot Dataset project aims to create a high-quality, domain-specific dataset that will enable the development of advanced AI models tailored to the TON ecosystem.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
misc		misc
src		src
.gitignore		.gitignore
.prettierrc		.prettierrc
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TON Copilot Dataset

Project Overview

Installation and Usage

Prerequisites

Installation

Setting up Environment Variables

Usage

Listing Repositories

Cloning Repositories

Additional Commands

Key Features

Methodology for Building a Domain-Specific LLM Dataset

About

Releases

Packages

Languages

infinityspectra/ton-copilot-dataset

Folders and files

Latest commit

History

Repository files navigation

TON Copilot Dataset

Project Overview

Installation and Usage

Prerequisites

Installation

Setting up Environment Variables

Usage

Listing Repositories

Cloning Repositories

Additional Commands

Key Features

Methodology for Building a Domain-Specific LLM Dataset

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages