gh_llm_loader
is a package designed to clone GitHub repositories and prepare their data for ingestion for LLMs.
- Prepare Repository Data: Compiles the contents of repositories into a single, clean file by excluding specified folders and files, or including only specified folders and files. This streamlined format is more accessible for LLM ingestion.
- Flexible File Filtering: Filter files based on extensions, filenames, or custom functions for maximum control over the included content.
- CLI and Library Integration: Flexibility for various use cases and workflows.
To install gh_llm_loader
, make sure you have Python installed on your system, then run the following command:
git clone https://github.com/mewmix/gh_llm_loader
cd gh_llm_loader
pip install .
Prerequisites:
- Python 3.6 or newer
- Git installed on your system
gh_llm_loader
can be easily integrated into Python scripts. Here's an example of how to use it:
from gh_llm_loader import clone_and_prepare_repo
# Define the GitHub repository URL
git_url = "https://github.com/yourusername/yourrepository.git"
# Clone and prepare the repository, specifying folders and files to ignore or include
clone_and_prepare_repo(git_url, ignored_folders={'node_modules', '.git'}, ignored_files={'README.md'}, included_folders={'.teamcity'}, file_filter=lambda f: f.endswith('.xml'))
This function will clone the specified GitHub repository and prepare its data by compiling the files into a single file, excluding any folders or files specified, and including only the specified folders and files that match the filter criteria.
If you wish to simply curate a non github folder with the same methods the core function is available for import and use-
import os
from gh_llm_loader import compile_files_to_single_file
# Specify the path to your project directory
source_path = "/path/to/your/project"
# Define the name for the output file
output_filename = "project_compiled.txt"
# Specify any folders or files you want to ignore during compilation
ignored_folders = {'node_modules', '.git', 'build'}
ignored_files = {'README.md', 'LICENSE'}
# Compile the project files into a single file
compile_files_to_single_file(source_path, output_filename, ignored_folders, ignored_files)
print(f"Compilation complete. The output is saved in {output_filename}")
If you wish to simply curate a non github folder with the same methods the core function is available for import and use-
import os
from gh_llm_loader import compile_files_to_single_file
# Specify the path to your project directory
source_path = "/path/to/your/project"
# Define the name for the output file
output_filename = "project_compiled.txt"
# Specify any folders or files you want to ignore during compilation
ignored_folders = {'node_modules', '.git', 'build'}
ignored_files = {'README.md', 'LICENSE'}
# Compile the project files into a single file
compile_files_to_single_file(source_path, output_filename, ignored_folders, ignored_files)
print(f"Compilation complete. The output is saved in {output_filename}")
For those preferring to use the command line, here are some examples:
a) Only use local folders, not Github
gh-llm-loader --base-dir test
b) Include only Python files (with github):
gh-llm-loader --git-url https://github.com/psf/requests --file-filter "lambda f: f.endswith('.py')"
c) Include only Markdown and text files (with github):
gh-llm-loader --git-url https://github.com/tensorflow/models --file-filter "lambda f: f.endswith('.md') or f.endswith('.txt')"
d) Include only files with "test" in the filename (with github):
gh-llm-loader --git-url https://github.com/django/django --file-filter "lambda f: 'test' in f"
e) Include only JavaScript files and the "package.json" file (with github):
gh-llm-loader --git-url https://github.com/facebook/react --file-filter "lambda f: f.endswith('.js') or f == 'package.json'"
f) Include only files in the "src" and "docs" folders (with github):
gh-llm-loader --git-url https://github.com/vuejs/vue --included-folders src docs
g) Exclude the "tests" folder and include only Python files (with github):
gh-llm-loader --git-url https://github.com/pallets/flask --ignored-folders tests --file-filter "lambda f: f.endswith('.py')"
Contributions to gh_llm_loader
are highly encouraged and appreciated.
gh_llm_loader
is made available under the MIT License. For more details, see the LICENSE file in the repository.