GTT (Git to Text) extracts GitHub repository content into a single text file for easy analysis by large language models. It also extracts images into a separate folder.
- Clone and extract GitHub repo content.
- Combine README and code files into one text file.
- Exclude files larger than a specified character limit (default: 50,000).
- Extract images into a separate folder.
- Summary includes word count and token count.
- (Optional) Create and activate a virtual environment:
python3 -m venv venv source venv/bin/activate
- Install the required libraries:
pip install GitPython tqdm
To extract contents from a GitHub repo:
python main.py <github_url> [options]
Example:
python main.py https://github.com/torvalds/linux.git -df 0 -maxchar 100000
<github_url>
: URL of the GitHub repository.-df, --delete-folder
: (Optional) Delete the downloaded repository folder after extraction. Default is1
(delete the folder). Set to0
to retain.-maxchar
: (Optional) Maximum character count per file. Default:50000
.
- Combined Text File:
<repo_name>_combined.txt
(e.g.,linux_combined.txt
) containing:- Repo structure, README content, and relevant files.
- Images Folder: Extracted images saved to an
images
folder. - Summary: Word count and estimated token count printed at the end.
- Token count is estimated as
2x
the word count. - The script skips the
.git
folder.
This project is licensed under the MIT License.