This project implements a next-word prediction model for Thai text using Long Short-Term Memory (LSTM) networks and natural language processing techniques. The model is trained to predict the next word given a sequence of words from Thai text.
The project includes a Python script model.py
that performs the following tasks:
- Data Preparation: Reads Thai text data, tokenizes it, and prepares sequences for training.
- Model Training: Builds and trains an LSTM-based model for predicting the next word in a sequence.
- Prediction: Uses the trained model to predict the next word based on a given input text.
model.py
: The main script that contains the code for data preparation, model training, and prediction.data/language.txt
: The text file containing the Thai language data used for training.
Ensure you have the necessary libraries installed. You can install them using pip:
pip install numpy tensorflow pythainlp
-
Prepare Data: Ensure that
data/language.txt
contains the Thai text data you want to use for training. -
Run the Script: Execute the
model.py
script to train the model and make predictions:python model.py
-
Input Text Prediction: The script will print the predicted next word for the input text "วันนี้".
The script reads the text from data/language.txt
, tokenizes it using pythainlp
, and creates sequences of a specified length to be used for training.
- Embedding Layer: Converts words into dense vectors of fixed size.
- LSTM Layer: A Long Short-Term Memory layer to capture dependencies in sequences.
- Dense Layer: A fully connected layer with a softmax activation function to output probabilities for each word in the vocabulary.
The model is compiled with categorical crossentropy loss and the Adam optimizer. It is trained for 10 epochs.
The trained model predicts the next word based on an input text. The predicted word is displayed in the console.
This project is licensed under the MIT License.
- PyThaiNLP for Thai language tokenization.
- TensorFlow for the machine learning framework.