The Zamia Brain project provides infrastructure useful to create natural language processing systems based on transformer networks (see https://arxiv.org/abs/1706.03762).
This project is still highly experimental, everything is subject to change without prior notice. The current approach is to generate training corpora for pre-training as well as (multi-)domain refinement. The goal is to train networks that are very robust (i.e. avoid brittleness present in traditional rule-based systems) in their natural language processing capabilities (pretraining) while allowing for a certain amount of control of their behavior (refinement).
For this, you will find these components:
-
scripts to generate pre-training corpora, typically using web scraping techniques as well as scripts that adapt scientific copora for training (https://github.com/gooofy/zbrain)
-
scripts that generate corpora from patterns ("skills") for refinement (https://github.com/gooofy/zbrain)
-
A GPT-2 implementation along with tokenization, training and inference tools (https://github.com/gooofy/transformer-lm)
-
A TransformerXL implementation along with tokenization, training and inference tools (https://github.com/gooofy/transformer-xl)
-
Pre-trained models (https://goofy.zamia.org/zamia-speech/brain/)
You will need to provide a list of accounts as input:
./twitterscrape.py -l de -s 2019-08-01 twitter_de_201908 -U user_stats_de.json twitter_sources_de.txt
This will work for GPT-2 as well as TransformerXL:
./qa_export_transformer-lm.py -o base_de heise parole twitter_de_2010 twitter_de_201907 wikipedia_de
Next, encode corpus the corpus using a sentencepiece tokenization model and run the pretraining.
-q option is important here to include dialog samples
./qa_export_transformer-lm.py -q -o qa_de twitter_de_2010 twitter_de_201907 skill_personal_de
Next, encode corpus the corpus using a sentencepiece tokenization model and continue the training for finetuning.
This is just a sketch of what a full QA chat bot with associative memory could look like:
+------------------+ +------------------------------+ | Dialog Ctx 1 | | Knowledge Base | | Dialog Ctx 2 | +--------------------+ +------------------------------+ | ... |---------> | DeepNN_Context_Vec | | 0.1 0.2 0.01 ... | KB Line 1 | | Dialog Ctx n | +--------------------+ | ... | | +------------------+ | | 0.33 0.1 0.5 ... | KB Line m | | Current Input | | +------------------------------+ +------------------+ | | | | | | | +----------------------------------------+ | | | | | | | v v | +---------------------------+ | | Nearest Neighbour search | | +---------------------------+ | | | | | v | +--------------------------------+ | | KB Context Lines | | +--------------------------------+ | | 0.4 0.2 0.9 ... Info line 1 | | | ... | | | 0.8 0.2 0.4 ... Info line k | | +--------------------------------+ | | | | | +----------------------------+ | | | | v v +-----------------------+ | Info Line 1 | | ... | | Info Line k | +-----------------------+ | Ctx Line 1 | | ... | | Ctx Line n | +-----------------------+ | Current Input | +-----------------------+ | | v +-------------+ | DeepNN_QA | +-------------+ | | +-------------+ | Response | +-------------+ [ Knowledge + Dialog History + Current Input ] -> [ Response ]
<pre> DS_i → data/qa_src/DS_i/#.json \ DS_j → data/qa_src/DS_j/.json | . \ data/qa_enc/train/.json . / data/qa_enc/val/.json . | DS_n → data/qa_src/DS_n/##.json / </pre>
-
TWITTER
-
personachat
-
slashdot
-
reddit
-
SQuAD 2.0
-
CoQA
-
SQuAD 1.1
-
MC Test
-
DeepMind CNN/DM http://www.github.com/deepmind/rc-data/
-
MS MARCO http://www.msmarco.org/
-
TriviaQA
-
NewsQA
-
NarrativeQA https://github.com/deepmind/narrativeqa
-
HotpotQA
-
natural_questions
-
QuestionBank
-
WebQuestions
-
wordrobe20140627.csv.gz
-
YahooAnswers
-
CommonsenseQA
-
ComplexWebQuestions
-
bAbI
-
Zamia AI
-
74M AIML bots
-
142M chat_corpus https://github.com/Marsan-Ma-zz/chat_corpus https://github.com/Marsan-Ma/twitter_scraper
34M open subtitles 21M twitter_en * 41M cornell_movie_dialogs_corpus * 33M cornell_movie_quotes_corpus.zip * 0.2M Microsoft Research Social Media Conversation Corpus * 4.3M swb1_dialogact_annot.tar.gz * 7800M The Ubuntu Dialogue Corpus v1.0 * NPS Chat Corpus (NLTK) * Internet archive Twitter stream https://archive.org/search.php?query=collection%3Atwitterstream&sort=-publicdate&page=2 * 58M chatterbot-logs
-
WikiData
-
conceptnet5
-
framenet_v15
-
HappyDB
-
linkedgeodata
-
nell
-
opencyc
-
SemLink
-
SUMO
-
UMBEL
-
weather
-
wordnet
-
scalable:
-
XL-Net
-
Transformer XL
-
OpenAI GPT-2
-
How to build a State-of-the-Art Conversational AI with Transfer Learning https://medium.com/huggingface/how-to-build-a-state-of-the-art-conversational-ai-with-transfer-learning-2d818ac26313
-
python newspaper article extractor https://github.com/codelucas/newspaper
-
OpenWebText https://github.com/jcpeterson/openwebtext https://pushshift.io/ http://files.pushshift.io/reddit/
-
How does TF’s universal sentence encoder work?
-
Transformer https://arxiv.org/pdf/1706.03762.pdf https://www.tensorflow.org/alpha/tutorials/sequences/transformer