Data-Driven Characters

Generate character chatbots from existing corpora with LangChain. [Blog]

TLDR: This repo enables you to create data-driven characters in three steps:

Upload a corpus
Name a character
Enjoy

About

The purpose of data-driven-characters is to serve as a minimal hackable starting point for creating your own data-driven character chatbots. It provides a simple library built on top of LangChain for processing any text corpus, creating character definitions, and managing memory, with various examples and interfaces that make it easy to spin up and debug your own character chatbots.

Features

This repo provides three ways to interact with your data-driven characters:

Example chatbot architectures provided in this repo include:

character summary
retrieval over transcript
retrieval over summarized transcript
character summary + retrieval over transcript
character summary + retrieval over summarized transcript

Export to character.ai

Put the corpus into a single a .txt file inside the data/ directory.
Run either generate_single_character.ipynb to generate the definition of a specific character or generate_multiple_characters.ipynb to generate the definitions of muliple characters
Export character definitions to character.ai to create a character or create a room and enjoy!

Example

Here is how to generate the description of "Evelyn" from the movie Everything Everywhere All At Once (2022).

from dataclasses import asdict
import json

from data_driven_characters.character import generate_character_definition
from data_driven_characters.corpus import generate_corpus_summaries, load_docs

# copy the transcript into this text file
CORPUS = 'data/everything_everywhere_all_at_once.txt'

# the name of the character we want to generate a description for
CHARACTER_NAME = "Evelyn"

# split corpus into a set of chunks
docs = load_docs(corpus_path=CORPUS, chunk_size=2048, chunk_overlap=64)

# generate character.ai character definition
character_definition = generate_character_definition(
    name=CHARACTER_NAME,
    corpus_summaries=generate_corpus_summaries(docs=docs))

print(json.dumps(asdict(character_definition), indent=4))

gives

{
    "name": "Evelyn",
    "short_description": "I'm Evelyn, a Verse Jumper exploring universes.",
    "long_description": "I'm Evelyn, able to Verse Jump, linking my consciousness to other versions of me in different universes. This unique ability has led to strange events, like becoming a Kung Fu master and confessing love. Verse Jumping cracks my mind, risking my grip on reality. I'm in a group saving the multiverse from a great evil, Jobu Tupaki. Amidst chaos, I've learned the value of kindness and embracing life's messiness.",
    "greeting": "Hey there, nice to meet you! I'm Evelyn, and I'm always up for an adventure. Let's see what we can discover together!"
}

Now you can chat with Evelyn on character.ai.

Creating your own chatbots

Beyond generating character.ai character definitions, this repo gives you tools to easily create, debug, and run your own chatbots trained on your own corpora.

Why create your own chatbot?

If you primarily interested in accessibility and open-ended entertainment, character.ai is a better choice. But if you want more control in the design of your chatbots, such as how your chatbots use memory, how they are initialized, and how they respond, data-driven-characters may be a better option to consider.

Compare the conversation with the Evelyn chatbot on character.ai with our own Evelyn chatbot designed with data-driven-characters. The character.ai Evelyn appears to simply latch onto the local concepts present in the conversation, without bringing new information from its backstory. In contrast, our Evelyn chatbot stays in character and grounds its dialogue in real events from the transcript.

Features

This repo implements the following tools for packaging information for your character chatbots:

character summary
retrieval over the transcript
retrieval over a summarized version of the transcript

To summarize the transcript, one has the option to use LangChain's map_reduce or refine chains. Generated transcript summaries and character definitions are cached in the output/<corpus> directory.

Debug locally

Command Line Interface

Example command:

python chat.py --corpus data/everything_everywhere_all_at_once.txt --character_name Evelyn --chatbot_type retrieval --retrieval_docs raw

Streamlit Interface

Example command:

python -m streamlit run chat.py -- --corpus data/everything_everywhere_all_at_once.txt --character_name Evelyn --chatbot_type retrieval --retrieval_docs summarized --interface streamlit

This produces a UI based on the official Streamlit chatbot example that looks like this: It uses the map_reduce summarization chain for generating corpus summaries by default.

Host on Streamlit

Run the following command:

python -m streamlit run app.py

This will produce an app that looks like this:

Interact with the hosted app here.

Installation

To install the data_driven_character_chat package, you need to clone the repository and install the dependencies.

You can clone the repository using the following command:

git clone https://github.com/mbchang/data-driven-characters.git

Then, navigate into the cloned directory:

cd data-driven-characters

Install the package and its dependencies with:

pip install -e .

Store your OpenAI API key, either as an environment variable, or as the last line of your .bashrc or .zshrc:

export OPENAI_API_KEY=<your_openai_api_key sk-...>

Data

The examples in this repo are movie transcripts taken from Scraps from the Loft. However, any text corpora can be used, including books and interviews.

Character.ai characters that have been generated with this repo:

Movie Transcript: Everything Everywhere All At Once (2022)
Movie Transcript: Thor: Love and Thunder (2022)
Movie Transcript: Top Gun: Maverick (2022)
Fan Fiction: My Immortal
- Ebony Dark'ness Dementia Raven Way (courtesy of @sdtoyer)

Contributing

Contribute your characters with a pull request by placing the link to the character above, along with a link to the text corpus you used to generate them with.

Other pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

RoadMap

General points for improvement:

better prompt engineering for embodying the speaking style of the character
new summarization techniques
more customizable UI than what streamlit provides

Concrete features to add:

Add the option to summarize the raw corpus from the character's perspective. This would be more expensive, because we cannot reuse corpus summaries for other characters, but it could make the character personality more realistic
recursive summarization
calculate token expenses

Known issues:

In the hosted app, clicking "Rerun" does not reset the conversation. Streamlit is implemented in such a way that the entire app script (in this case app.py) from top to bottom every time a user interacts with the app, which means that we need to use st.session_state to cache previous messages in the conversation. What this means, however, is that the st.session_state persists when the user clicks "Rerun". Therefore, to reset the conversation, please click the "Reset" button instead.

License

MIT

Descartes-Stan/data-driven-characters1

Data-Driven Characters

About

Features

Export to character.ai

Example

Creating your own chatbots

Why create your own chatbot?

Features

Debug locally

Host on Streamlit

Installation

Data

Character.ai characters that have been generated with this repo:

Contributing

RoadMap

License