- Scraping raw code from GitHub
- Filtering out irrelevant code
- Preperation for HuggingFace Dataset (Tokenization, Dedupe, etc.)
- Generation of QA pairs
- Pushing to HuggingFace Dataset
-
Go to https://github.com/settings/tokens and create new token by clicking
Generate New Tokenbutton. Give read access to public repositories. -
Copy the access token and set the env variable via
export GH_ACCESS_TOKEN=<copied access token>. -
cd dataset_creationand runpython clone_hf_repos.py -
The data in
synth_source_reposshould look like this:alifarooq9/rapidlaunch DarkInventor/easy-ui horizon-ui/shadcn-nextjs-boilerplate ixartz/SaaS-Boilerplate lucide-icons/lucide moinulmoin/chadnext nobruf/shadcn-landing-page shadcn-ui/taxonomy
-
Download nltk punkt
import nltk nltk.download('punkt')
-
Run Data Pipeline on a machine with 16 CPUs:
python pipeline.py
-
Collate and push to HF hub:
python prepare_hf_dataset.py
- Go to https://huggingface.co/settings/tokens and create a new token by clicking
Create new tokenbutton. Select thereadscope and clickCreate token. - Copy the access token and set the env variable via
export HF_TOKEN=<copied access token>. cd dataset_creationand runpython generate_qa_pairs.py. (this will take a while)- Run
push_qa_pairs.pyto push the dataset to the HuggingFace Hub.
ICON_REPOS = [
"lucide-icons/lucide" # ISC
]
UI_REPOS = [
"shadcn-ui/ui", # MIT
"DarkInventor/easy-ui" # MIT
]
CODE_REPOS = [
"moinulmoin/chadnext", # MIT
"shadcn-ui/taxonomy", # MIT
"horizon-ui/shadcn-nextjs-boilerplate", # MIT
"alifarooq9/rapidlaunch", # MIT
"ixartz/SaaS-Boilerplate", # MIT
"nobruf/shadcn-landing-page" # None
]The datasets (especially the QA dataset) are ment to fine-tune instruction models. Great examples for resulting models are Octocoder or StarChat.
Note: Since the resources for this project were limited, the fine-tuning is optimized to run on a single A100 GPU with the highest RAM settings on Colab.
- Codebase: The codebase uses
Flash Attention V2support in Transformers. - Colab Notebook: Minimum one A100 GPU is required.
- Model to be fine-tuned: bigcode/starcoderplus
- Dataset: JulianAT/SynthUI-Code-Instruct-2k-v1
- Trained Model: JulianAT/StarCoder-Plus-SynthUI-Code-Instruct-2k-v1
This project was done as a proof of concept for the potential of fine-tuning instruction models with synthetic data. For the scope of this project the synthetic data was scraped from GitHub. However, the approach can be easily replicated with other data sources.
The resulting model was used for code generation in the WebApp of Synth UI.