Create cool and awkward names with Language Models!
Predicted the name: RUNNATAIDENILOS
Prefix: RU
Context Size: 2
Seed: 1
Language Models are tasked with assigning a probability to a word or even a sentence. They correct the misspelled words you type on your cell phone, as well as help your personal assistant to understand you.
In this fun project, I used them to make a probabilistic model of the characters of Brazilian names using data from the 2010 census. Then, I used these models to generate new names.
It works by guessing next letters based on the previous ones. For instance, what is the most probable name given that the name starts with Pau...? For the English language it will probably be Paul, while for Portuguese it will be Paulo. However, if we use a small enough context size (e.g., number of previous letters to infer the next one), awkward and cool names start to appear =)
- Cookiecutter Data Science Project Structure
- Python Data Science Tools (Pandas, Numpy, etc)
You can use this project with docker or install locally in your machine
- Docker
or
- Linux/WSL
- Conda
- Clone the repo
git clone https://github.com/renan-cunha/NameGeneratorBR cd NameGeneratorBR/
- Create environment
make create_environment conda activate NameGeneratorBR
- Install requirmeents
make requirements
The repo has five trained models, from context size equal to 0 (e.g., the next letter is predicted by how much it appears in the dataset) to 4 (e.g., the previous four letters are used to infer the next one).
If you want just to generate a new name, use the src/models/predict_model.py
with the following options:
Usage: predict_model.py [OPTIONS]
Options:
-cs, --context_size INTEGER How much context to use for the language model,
The pre-trained models go from 0 to 4
-p, --prefix TEXT The beginning of the name to be predicted (OPTIONAL)
-s, --seed INTEGER Seed to reproduce experiments (OPTIONAL)
--help Show this message and exit.
Ex:
(NameGeneratorBR) renan@DESKTOP-AD25DOI:~/git/NameGeneratorBR$ python src/models/predict_model.py -cs 4 -p pau -s 0
Predicted the name: PAULO
Prefix: PAU
Context Size: 4
Seed: 0
To reproduce the training, use the command below
make train_model
Pull the image
docker pull renancunha97/name-generator-br
And make new names
renan@DESKTOP-AD25DOI:~$ docker run renancunha97/name-generator-br -cs 4 -p pau -s 0
Predicted the name: PAULO
Prefix: PAU
Context Size: 4
Seed: 0
Distributed under the MIT License. See LICENSE
for more information.
Renan Cunha - renancunhafonseca@gmail.com
If you are curious about Language Models and Natural Language Processing in general, I highly recommend Jurafsky's drafts of Speech and Language Processing 3rd edition and his classes.