cross-words
is a python module that allows you to easily create a corpus of documents with parameterized entities.
The main goal of cross-words
is to offer an easy way to create either sentences or stories for use in chat bot training.
As of May 2018, it is mostly designed to be used with Rasa NLU/Core
You can install it with pip:
pip install cross-words
Or directly from github if you want the latest development version
pip install git+https://github.com/data-chirps/cross-words.git
cross-words
is based on a simple yet powerful Domain Specific Language.
When used along with Rasa NLU/Core, it uses 3 concepts:
- intents: the objective of the chatbot's user (e.g. ask to book a restaurant, confirm a chatbot inquiry etc.)
- entities: specific parts of a sentence containing key information (e.g. which restaurant to book, how many people etc.)
- aliases: lists of synonyms that can be used interchangeably
More details are available at Rasa NLU
Given a configuration file (.txt) containing all of the above, cross-words
is able to generate many training sentences/conversations using combinations of sentence parts.
cross-words
configuration files look like this:
Could I have the number of @[subject_filter] ~[owners] in @[geo_filter] @[time_filter]?
@[time_filter]
this month
this year
LTD
life to date
up to date
since release
since launch
since beginning of fiscal year
@[geo_filter]
France
Germany
US
United States
America
Canada
Italy
@[subject_filter]
birds
parrots
owl
dogs
cats
persian
~[owners]
owners
possessors
If asked for sentences, cross-words
will generate a .md file whose first lines will be :
- Could I have the number of [birds](subject_filter) possessors in [Canada](geo_filter) [life to date](time_filter)?
- Could I have the number of [parrots](subject_filter) possessors in [United States](geo_filter) [since release](time_filter)?
- Could I have the number of [owl](subject_filter) possessors in [Italy](geo_filter) [up to date](time_filter)?
- Could I have the number of [owl](subject_filter) possessors in [Italy](geo_filter) [since release](time_filter)?
- Could I have the number of [dogs](subject_filter) owners in [United States](geo_filter) [LTD](time_filter)?
- Could I have the number of [dogs](subject_filter) owners in [Canada](geo_filter) [this year](time_filter)?
- Could I have the number of [cats](subject_filter) owners in [France](geo_filter) [this year](time_filter)?
- Could I have the number of [cats](subject_filter) owners in [US](geo_filter) [since release](time_filter)?
- Could I have the number of [cats](subject_filter) owners in [America](geo_filter) [this month](time_filter)?
- Could I have the number of [cats](subject_filter) owners in [Canada](geo_filter) [life to date](time_filter)?
This file is then ready to use as training input to Rasa NLU.
If asked for stories:
## Genereated Story 815310784239368
* acquisition{}
- utter_ask_time_filter
* acquisition{"time_filter": "since beginning of fiscal year"}
- slot{"time_filter": "since beginning of fiscal year"}
- utter_ask_geo_filter
* acquisition{"geo_filter": "America"}
- slot{"geo_filter": "America"}
- utter_ask_subject_filter
* acquisition{"subject_filter": "dogs"}
- slot{"subject_filter": "dogs"}
- action_acquisition
## Genereated Story 257661587723758
* acquisition{"time_filter": "since release", "geo_filter": "Germany"}
- slot{"time_filter": "since release"}
- slot{"geo_filter": "Germany"}
- utter_ask_subject_filter
* acquisition{"subject_filter": "owl"}
- slot{"subject_filter": "owl"}
- action_acquisition
## Genereated Story 877699493192194
* acquisition{"subject_filter": "parrots"}
- slot{"subject_filter": "parrots"}
- utter_ask_time_filter
* acquisition{"time_filter": "LTD"}
- slot{"time_filter": "LTD"}
- utter_ask_geo_filter
* acquisition{"geo_filter": "France"}
- slot{"geo_filter": "France"}
- action_acquisition
This file is then ready to use for training with Rasa Core.
cross-words
mainly comes with 2 functions: parse_input and generate. All other functions are implementation details.
generate(input_path, output_path="./xwords/outputs/", intent_string=None, output_prefix='', training_ratio=1.0, for_story=False, n_sub=None)
This is the main function of `cross-words'.
Given an input configuration file, it outputs all combinations of intents x entities x aliases into a .md file ready for training.
A few arguments allow to tune its behavior:
- input_path: path to the configuration file (string)
- output_path: path to the output folder where train/test files will be written (string)
- intent_string string to specify intent at the beginning of sentence files (for Rasa NLU) or inside genereated stories (for Rasa Core) (string)
- output_prefix string to specify beginning of names of files that are written (string)
- training_ratio: ratio between train and test sets. If .7, 30% of all generated combinations will be reserved into a test file. If 1.0, no test file will be created. (float)
- for_story: whether to generate sentences (for Rasa NLU) or stories (for Rasa Core) (bool)
- n_sub: number of sentences/stories (incl. test) to be taken as a subsample of all possible combinations of intents x entities x aliases (int) (required when generating stories for Rasa Core)
This function is provided as a facilitator for experimentation purposes. It is the first function called by generate.
Given an input configuration file, generates:
- a list of intents in the form
['intent_sentence_0', 'intent_sentence_1', ...]
e.g. from above:
['Could I have the number of @[subject_filter] ~[owners] in @[geo_filter] @[time_filter]?']
- a dictionnary of entitites in the form
{'entity_0': ['alternative_00', 'alternative_01', ...],
'entity_1': ['alternative_10', 'alternative_11', ...], ...}
e.g. from above:
{'time_filter': ['this month', 'this year', ...],
'geo_filter': ['France', 'Germany', ...], ...}
- a dictionnary of synonyms in the form
{'alias_0': ['alternative_00', 'alternative_01', ...],
'alias_1': ['alternative_10', 'alternative_11', ...], ...}
e.g. from above:
{'owners': ['owners', 'possessors']}
cross-words
is designed to compute sentences by placing all entities and alias alternative into all intents.
As a rule of thumb, the overall maximum number of generated sentences is in the order of:
nbintent sentences × avg. nbentity placeholders per intent sentence × avg. nbalternatives per entity × avg. nbalias placeholders per intent sentence × avg. nbalternatives per alias
As such, the created training files grow exponentially, hence the available n_sub parameter in generate
In the specific case of stories (Rasa Core), cross-words
will also use information availability as an additional combination dimension.
For example, the two stories below are based on a different initially available information set given by the user:
## Genereated Story 257661587723758
* acquisition{"time_filter": "since release", "geo_filter": "Germany"}
- slot{"time_filter": "since release"}
- slot{"geo_filter": "Germany"}
- utter_ask_subject_filter
* acquisition{"subject_filter": "owl"}
- slot{"subject_filter": "owl"}
- action_acquisition
## Genereated Story 877699493192194
* acquisition{"time_filter": "since release"}
- slot{"time_filter": "since release"}
- utter_ask_subject_filter
* acquisition{"subject_filter": "owl"}
- slot{"subject_filter": "owl"}
- utter_ask_geo_filter
* acquisition{"geo_filter": "Germany"}
- slot{"geo_filter": "Germany"}
- action_acquisition