This project is a text generator built using PyTorch Lightning. It leverages the BookCorpus dataset from Hugging Face and a custom tokenizer to generate text based on the input data.
text-generator/
├── pycache/
├── checkpoints/
├── lightning_logs/
├── app.py
├── bookcorpus.txt
├── data.py
├── model.py
├── tokenizer.json
├── requirements.txt
└── README.md
The dataset used in this project is the BookCorpus dataset from Hugging Face. The BookCorpus dataset is a large-scale dataset consisting of books written by unpublished authors. It contains 11,038 books and is used for various NLP tasks such as language modeling, text generation, and more.
For more information on the BookCorpus dataset, visit the Hugging Face BookCorpus page.
To set up the environment and install the required dependencies, follow these steps: Use Anaconda Platform
- Clone the repository:
git clone https://github.com/bhuvi-ai/Text-Generator-Using-Pytorch.git cd Text-Generator-Using-Pytorch
- Create and activate a Conda environment:
conda create -n text-generator python=3.10.0 conda activate text-generator
- Install the required packages:
pip install -r requirements.txt
How to Run
- Prepare the dataset:
using data.py preprepare daaset by donwloading and preprocessing
python data.py
Ensure that the bookcorpus.txt file is placed in the project directory. This file should contain the preprocessed text data from the BookCorpus dataset.
2. Train the model:
To train the text generator model, run the following command
python model.py
- Run the Streamlit app: To launch the Streamlit app and interact with the text generator, run:
streamlit run app.py
Explanation of Files
pycache/: Directory containing Python bytecode files.
checkpoints/: Directory where model checkpoints are saved during training.
lightning_logs/: Directory where PyTorch Lightning logs are stored.
app.py: Streamlit app for interacting with the text generator.
bookcorpus.txt: Preprocessed text data from the BookCorpus dataset.
data.py: Script for data preprocessing and loading.
model.py: Script defining the text generator model and training process.
tokenizer.json: Custom tokenizer configuration file.
requirements.txt: File containing the list of dependencies required for the project.