Zamia Brain

The Zamia Brain project provides infrastructure useful to create natural language processing systems based on transformer networks (see https://arxiv.org/abs/1706.03762).

This project is still highly experimental, everything is subject to change without prior notice. The current approach is to generate training corpora for pre-training as well as (multi-)domain refinement. The goal is to train networks that are very robust (i.e. avoid brittleness present in traditional rule-based systems) in their natural language processing capabilities (pretraining) while allowing for a certain amount of control of their behavior (refinement).

For this, you will find these components:

scripts to generate pre-training corpora, typically using web scraping techniques as well as scripts that adapt scientific copora for training (https://github.com/gooofy/zbrain)
scripts that generate corpora from patterns ("skills") for refinement (https://github.com/gooofy/zbrain)
A GPT-2 implementation along with tokenization, training and inference tools (https://github.com/gooofy/transformer-lm)
A TransformerXL implementation along with tokenization, training and inference tools (https://github.com/gooofy/transformer-xl)
Pre-trained models (https://goofy.zamia.org/zamia-speech/brain/)

Corpora

Twitter

You will need to provide a list of accounts as input:

./twitterscrape.py -l de -s 2019-08-01 twitter_de_201908 -U user_stats_de.json twitter_sources_de.txt

Heise

Important: adapt hard-coded paths firs!

./qa_extract_heise.py

Parole

Important: adapt hard-coded paths firs!

./qa_extract_parole.py

Wikipedia

Important: adapt hard-coded paths firs!

./qa_extract_wikipedia.py -l de

Export for pretraining

This will work for GPT-2 as well as TransformerXL:

./qa_export_transformer-lm.py -o base_de heise parole twitter_de_2010 twitter_de_201907 wikipedia_de

Next, encode corpus the corpus using a sentencepiece tokenization model and run the pretraining.

Extract skills from Zamia AI

qa_extract_skills.py -l de skill_personal_de personal.xml

QA finetuning

-q option is important here to include dialog samples

./qa_export_transformer-lm.py -q -o qa_de twitter_de_2010 twitter_de_201907 skill_personal_de

Next, encode corpus the corpus using a sentencepiece tokenization model and continue the training for finetuning.

TODO: KB encoding / indexing / lookup

FIXME

Architecture

This is just a sketch of what a full QA chat bot with associative memory could look like:

                        +------------------+                                                        +------------------------------+
                        | Dialog Ctx     1 |                                                        |       Knowledge Base         |
                        | Dialog Ctx     2 |           +--------------------+                       +------------------------------+
                        | ...              |---------> | DeepNN_Context_Vec |                       | 0.1 0.2 0.01 ... | KB Line 1 |
                        | Dialog Ctx     n |           +--------------------+                       | ...              |           |
                        +------------------+                     |                                  | 0.33 0.1 0.5 ... | KB Line m |
                        | Current Input    |                     |                                  +------------------------------+
                        +------------------+                     |                                                 |
                                 |                               |                                                 |
                                 |                               |        +----------------------------------------+
                                 |                               |        |
                                 |                               |        |
                                 |                               v        v
                                 |                     +---------------------------+
                                 |                     | Nearest Neighbour search  |
                                 |                     +---------------------------+
                                 |                                   |
                                 |                                   |
                                 |                                   v
                                 |                  +--------------------------------+
                                 |                  |         KB Context Lines       |
                                 |                  +--------------------------------+
                                 |                  | 0.4 0.2 0.9 ... Info    line 1 |
                                 |                  | ...                            |
                                 |                  | 0.8 0.2 0.4 ... Info    line k |
                                 |                  +--------------------------------+
                                 |                                   |
                                 |                                   |
                                 |      +----------------------------+
                                 |      |
                                 |      |
                                 v      v
                          +-----------------------+
                          | Info    Line 1        |
                          | ...                   |
                          | Info    Line k        |
                          +-----------------------+
                          | Ctx     Line 1        |
                          | ...                   |
                          | Ctx     Line n        |
                          +-----------------------+
                          | Current Input         |
                          +-----------------------+
                                     |
                                     |
                                     v
                              +-------------+
                              |  DeepNN_QA  |
                              +-------------+
                                     |
                                     |
                              +-------------+
                              |   Response  |
                              +-------------+


[ Knowledge + Dialog History + Current Input ] -> [ Response ]

Knowledge Base

Dialog

<pre> DS_i → data/qa_src/DS_i/#.json \ DS_j → data/qa_src/DS_j/.json | . \ data/qa_enc/train/.json . / data/qa_enc/val/.json . | DS_n → data/qa_src/DS_n/##.json / </pre>

Datasets

Dialog Corpora

TWITTER
personachat
slashdot
reddit
SQuAD 2.0
CoQA
SQuAD 1.1
MC Test
DeepMind CNN/DM http://www.github.com/deepmind/rc-data/
MS MARCO http://www.msmarco.org/
TriviaQA
NewsQA
NarrativeQA https://github.com/deepmind/narrativeqa
HotpotQA
natural_questions
QuestionBank
WebQuestions
wordrobe20140627.csv.gz
YahooAnswers
CommonsenseQA
ComplexWebQuestions
bAbI

Chat Corpora

Zamia AI
74M AIML bots

142M chat_corpus https://github.com/Marsan-Ma-zz/chat_corpus https://github.com/Marsan-Ma/twitter_scraper

           34M open subtitles
           21M twitter_en
*   41M    cornell_movie_dialogs_corpus
*   33M    cornell_movie_quotes_corpus.zip
*    0.2M  Microsoft Research Social Media Conversation Corpus
*    4.3M  swb1_dialogact_annot.tar.gz
* 7800M    The Ubuntu Dialogue Corpus v1.0
*          NPS Chat Corpus (NLTK)
*          Internet archive Twitter stream https://archive.org/search.php?query=collection%3Atwitterstream&sort=-publicdate&page=2
*   58M    chatterbot-logs

Knowledge

WikiData
conceptnet5
framenet_v15
HappyDB
linkedgeodata
nell
opencyc
SemLink
SUMO
UMBEL
weather
wordnet

AI Architecture Survey

scalable:
XL-Net
Transformer XL
OpenAI GPT-2
How to build a State-of-the-Art Conversational AI with Transfer Learning https://medium.com/huggingface/how-to-build-a-state-of-the-art-conversational-ai-with-transfer-learning-2d818ac26313
python newspaper article extractor https://github.com/codelucas/newspaper
OpenWebText https://github.com/jcpeterson/openwebtext https://pushshift.io/ http://files.pushshift.io/reddit/
BERT https://arxiv.org/pdf/1901.08634.pdf
How does TF’s universal sentence encoder work?
Transformer https://arxiv.org/pdf/1706.03762.pdf https://www.tensorflow.org/alpha/tutorials/sequences/transformer
SDNet https://arxiv.org/pdf/1812.03593.pdf

bho9668/zbrain

Zamia Brain

Corpora

Twitter

Heise

Parole

Wikipedia

Export for pretraining

Extract skills from Zamia AI

QA finetuning

TODO: KB encoding / indexing / lookup

Architecture

Knowledge Base

Dialog

Datasets

Dialog Corpora

Chat Corpora

Knowledge

AI Architecture Survey