Just a fun project teach myself a bit to train and deploy LLMs. Myrtle is a GPT2 model finetuned on quotes by non male notable people. Find myrtle here. She only speaks every 5 seconds. Find myrtle here. She only speaks every 5 seconds.
/----------------------------------------------------------------------------\
| The truth is that we are all flawed, and it takes courage to be ourselves. |
\----------------------------------------------------------------------------/
\
\
\
_
.~O`,
{__, \
\' \
\ \
\ \
\ `._ __.__
\ ~-._ _.==~~ ~~--.._
\ ' ~-.
\ _- -_ `.
\ / } .- . \
`. | / } ( ; \
`| / / ( : '\
\ | / | / \
| /`-.______.\ |~-. \
| |/ ( | `. \_
| || ~\ \ '._ `-.._____..----..___
| |/ _\ \ ~-.__________.-~~~~~~~~~'''
.o'___/ .o______}
This project originated in the CorrelAid Slack workspace. A colleague, Frie, started a thread of recommending terminal tools that are useless but spark joy, for example lolcat combined with cowsay. Cowsay basically draws a cow in your terminal and you can input text. From that I got the idea of somehow randomizing what the cow says and found fortune, a program that prints random quotes from a local database.
However, when trying fortune, I found some quotes to be pretty weird and even sexist. People have observed this before (read this thread). My next idea was to use an API to retrieve a random quote by famous people from the internet. While working on a bash script to combine this API with cowsay, I researched cowsay flags and cow alternatives and again found some weird stuff (take a look yourself if you are interested). Thats why I decided to use a wholesome, child friendly python package called dinosay instead. The result was a bash script that uses a random quote from the internet as an input for dinosay.
Not satisfied with this, I thought it would be funny to train a LLM to generate quotes that sound realistic. I wanted to learn how to this anyways so thats what I did. As training data, I used a large dataset of quotes found originally on GitHub and built for a paper (see below). As there are enough quotes of men circulating in the internet and LLMs tend to be gender biased I tried to only use quotes by non male people (I know that its not that easy). I also like to imagine Myrtle as a wise old female dinosaur. As you cant detect a persons gender through their name, I generated a list of names of non male notable people and only used quotes by authors with names in that list. I used this data to finetune a GPT2 model, because its small enough to not cost that much to run (no GPU necessary for inference) but good enough to sounds realistic. Processing and training was done remotely with modal.
This worked relatively good and I was suprised and had to laugh about how pseudo-wise some quotes sound. Try it yourself :)
I wrote a simple bash script to bring myrtles wisdom to your terminal.
As myrtle is a finetuned GPT2 model, it is simarily biased. Information about this here. Please bear in mind that I am not the one writing the quotes. While I tried to moderate the model output somewhat using various tools including a list of forbidden words, I cant guarantee non offensive text output.
- Laouenan, M., Bhargava, P., Eyméoud, J.B., Gergaud, O., Plique, G., & Wasmer, E. (2022). A cross-verified database of notable people, 3500BC-2018AD. Scientific Data, 9(1), 290.
- Goel, S., Madhok, R., & Garg, S. (2018). Proposing Contextually Relevant Quotes for Images. Advances in Information Retrieval. Springer. doi: 10.1007/978-3-319-76941-7_49