This is a tool designed to simplify the aggregation and preprocessing of various data sources for ingestion into large language models (LLMs). GitHub repositories, local directories, academic papers, YouTube transcripts and web pages are processed into text for LLMs through an efficient command-line interface.
- Sources: Extract text from GitHub repositories, local repo directories, webpages, YouTube transcripts, and arXiv papers.
- Integration: Supports Jupiter Notebook .ipynb and pdf formats.
- Web Crawling: Extract data from web sources by following links to a specified depth.
- Preprocessing: Outputs are generated in both compressed and uncompressed formats. Compressed output removes stopwords and whitespace and converts to lowercase to minimize token usage.
- Clipboard: Uncompressed text is automatically copied to the clipboard, ready for pasting into an LLM.
- Token Counts: Token counts provided for compressed and uncompressed outputs.
Before using, ensure you have the following dependencies installed:
pip install -U -r requirements.txt
You may also wish to create a virtual environment to manage dependencies:
python -m venv .venv
source .venv/bin/activate
pip install -U -r requirements.txt
For accessing private repositories on GitHub, generate a GitHub personal token as outlined in the 'Obtaining a GitHub Personal Access Token' section.
Clone the repository or download the source code. No additional installation is required.
python onefilellm.py
Enter the path or URL for ingestion:
The tool supports various input options, including:
- GitHub repository URL (e.g.,
https://github.com/jimmc414/onefilellm
) - arXiv abstract URL (e.g.,
https://arxiv.org/abs/2401.14295
) - Local folder path (e.g.,
C:\python\PipMyRide
) - YouTube video URL (e.g.,
https://www.youtube.com/watch?v=KZ_NlnmPQYk
) - Webpage URL (e.g.,
https://llm.datasette.io/en/stable/
)
uncompressed_output.txt
: Full text output, automatically copied to the clipboard.compressed_output.txt
: Cleaned and compressed text (e.g., all lowercase, whitespace and stop words removed).processed_urls.txt
: List of all processed URLs for web crawling.- To console: Token counts for both output files.
A GitHub Personal Access Token (PAT) is required to authenticate with the GitHub API and access private repositories. Follow these steps to generate a token:
Log in to your GitHub account and navigate to the Settings page by clicking on your profile picture in the top-right corner and selecting Settings.
In the left sidebar, click on Developer settings.
Click on Personal access tokens in the left sidebar.
Click the Generate new token button.
Enter a name for the token in the Note field (e.g., "Repo-Prep").
Select the appropriate scopes for the token. For the onefilellm.py script, the minimum required scope is repo (which grants full control of private repositories). You may need to select additional scopes depending on your use case.
Click the Generate token button at the bottom of the page.
In the onefilellm.py script, replace the GITHUB_TOKEN placeholder with your actual token or add to the %GITHUB_TOKEN% env variable as detailed to automatically pull it from your environment.
- Add Github Personal Access Token to environment variable GITHUB_TOKEN
-
Windows:
setx GITHUB_TOKEN "YourGitHubToken"
-
Linux:
echo 'export GITHUB_TOKEN="YourGitHubToken"' >> ~/.bashrc source ~/.bashrc
-
- For Repos, Modify this line of code to add or remove filetypes processed:
allowed_extensions = ['.py', '.txt', '.js', '.rst', '.sh', '.md', '.pyx', '.html', '.yaml','.json', '.jsonl', '.ipynb', '.h', '.c', '.sql', '.csv']
- For Web scraping, Modify this line of code to change how many links deep from the starting URL to include
max_depth = 2
Please note that the main script name has been changed from 1filellm.py
to onefilellm.py
. This change was made to adhere to Python naming conventions and improve clarity. The functionality of the script remains the same, and you can use onefilellm.py
instead of 1filellm.py
in all the commands and instructions mentioned in this README.