/KnowledgeGraphGenerator

Use Kore.ai's Knowledge Graph Generator to automatically extract terms from FAQs, define the hierarchy between these terms, and also associate the FAQs to the right terms.

Primary LanguagePython

kore.ai

KnowledgeGraphGenerator

Use Kore.ai's Knowledge Graph Generator to automatically extract terms from FAQs, define the hierarchy between these terms, and also associate the FAQs to the right terms.

Overview
Prerequisites
Configuration Steps
Run KnowledgeGraph Generator
Command Options
Option: langauge
Option: type
Output details
Using Synonym Generator
Graph analyzer
Troubleshooting

Overview

Kore.ai KnowledgeGraph Generator enables you to cut down your effort in building ontology in Knowledge Collection section by automating this process.

Output generated through this engine can be directly imported in KnowledgeCollection and you can use the faqs after training the KnowledgeCollection. However, user should manually add synonyms if he wants to. Since the engine won't support this yet.

If you have managed your stopwords the engine will consider only those stopwords in generating the graph considering the fact that you pass JSON or CSV export directly. If CSV format from extraction type is given as input or user haven't modified stopword collection, Engine uses it's predefined set of stopwords

Prerequisites

Configuration Steps

Configuring KnowledgeGraph Generator involves the following major steps:

  • Step 1: Download the KnowledgeGraphGenerator from GitHub : Find the repository here: https://github.com/Koredotcom/KnowledgeGraphGenerator

  • Step 2: Activate virtual environment: Execute the following command with required changes in it to activate the virtual environment
    source virtual_environments_folder_location/virtualenv_name/bin/activate
    Once the virtual environemnt is activated, you should see virtual environment name at the start of every command in the console. Something like this Image alt text

  • Step 3: Install requirements for the project: Run the following command from your project root directory (KnowledgeGraphGenerator) to install requirements
    pip install -r requirements.txt
    Run
    pip list
    command to verify is all the installed requirements

    Note -

    For Windows Operating System -

    1. Windows 10 users should install Windows 10 SDK. You can download it from here here
    2. Operating system should be upto date for seamless installation of requirements. Some libraries like scipy (internal dependency) need specific dll's which are available in latest updates. Avoiding this may involve lot of troubleshooting.


We verified installation with build version 1903.

  • Step 4. Download spacy english model: Run following command to download the model
    python -m spacy download en

Run KnowledgeGraph Generator

Ubuntu

python KnowledgeGraphGenerator.py --file_path 'INPUT_FILE_PATH' --type 'INPUT_TYPE' --language 'LANGUAGE_CODE' --v true

Windows

python KnowledgeGraphGenerator.py --file_path INPUT_FILE_PATH --type INPUT_TYPE --language LANGUAGE_CODE --v true
note - no quotes for command arguments in windows

The command which generates KnowledgeGraph have options which need to be passed while executing the command. The following are the options which are used.

Command Options

Note: : The options which are listed as mandatory should be given along with command, for options which are regarded as optional, the default values mentioned will be picked

Option name Description Mandatory / Optional Default value
--file_path Input file location Mandatory
--language The language code for langauge in which input data exist Optional en (english)
--type The type of input file
  1. json_export
  2. csv
  3. csv_export
Mandatory
--v Running command in verbose mode to see intermediate progress steps Optional false

Option: langauge

The following languages are supported currently and others will be added in incremental approach. Create an issue if any language is required on priority

Language Language Code
English en

Option: type

Type specifies the type of input file. Currently only three formats are supported and those are listed below:

  1. json_export -

    You can generate input in this format from kore.ai bot builder tool by exporting KnowledgeCollection and selecting JSON format for export

  2. csv_export -

    You can generate input int this format from kore.ai bot builder tool by exporting KnowledgeCollection and selecting CSV format for export

  3. csv -

    This format is enabled to support input from KnowledgeExtraction. To build input file in this format, all you need to do is copy all questions in first column and their respective answers in second column and save it as .csv file

Output details

Output JSON file generated can be located under project root directory with name of file as ao_output.json

The output JSON file can be directly imported to KnowledgeCollection in bot as json format

Using Synonym Generator

Synonym generator is an add-on tool developed to help the bot developer derive synonyms for the nodes in the KG. For this, one need to follow the following basic steps:

  • Step 1: Run KG generator and create an ontology for the given questions.
  • Step 2: Run synonym generator, giving this ontology as input.
  • Step 3: Take the synonyms file that is generated, edit it as required, and re-run KG generator with it to create the final ontology.

The synonym generator has the following modes of operation:

  • Using the answers from the knowledge graph to generate synonyms.
  • Using a given PDF document or ZIP of PDF documents to generate synonyms.
  • Using a pre-trained word2vec model to generate synonyms.

If there are a substantial number of voluminous answers in the KG, the first option will give a closed-domain set of synonyms. In the event that the KG is smaller or does not have enough content, one can use a collection of PDF documents to provide the corpus. The third option gives a way to generate open-domain synonyms as it can use any pretrained word2vec model.

Setting up Synonym Generator

  • Step 1: Download the GoogleNews model from https://github.com/mmihaltz/word2vec-GoogleNews-vectors. Alternatively, any other word2vec model can also be used.
  • Step 2: Change to the synonym_generator folder.
  • Step 3: Run synonym generator using the following command: python synonym_generator.py --file_path 'INPUT_FILE_PATH' --training_data_path 'TRAINING_FILE_PATH' --training_data_type 'INPUT_TYPE'
  • Step 4: The output is saved to a file called generated_synonyms.csv in that directory itself.

These parameters take the following values:

Option name Description Mandatory / Optional Default value
--file_path Input file location Mandatory
--training_data_path The path to the training data or a pretrained word2vec model. This can be either a PDF file or a ZIP containing PDFs or the path to a pretrained model. Optional None
--training_data_type The type of training file
  1. pdf
  2. zip
  3. pretrained
Optional pdf

Example Usage

The following is an example of how the synonym generator is to be used:

python synonym_generator.py --file_path oa_output.json

Using generated synonym file with KG generator

Use following command to run KnowledgeGraph generator with generated synonyms file.

python KnowledgeGraphGenerator.py --file_path 'INPUT_FILE_PATH' --type 'INPUT_TYPE' --language 'LANGUAGE_CODE' --v true --synonyms_file_path 'path_to_synonyms_file/synonyms_file.csv'

The generated graph export will have both graph level synonysm from input file (if present) and synonyms from the synonyms file under synonyms section

Graph analyzer

Graph generated by the tool may not meet human expectations. After graph is generated, the generated graph is analyzed by our analyzer tool which helps in identifying errors which results due to input data, the way it is. We may have two issues while preparing the graph.
This report can be located under project root directory with name of file as analyzer_report.csv. New report is appended to the current report. So the new report is always the last one found in the file with the latest timestamp. Just like log file.
Developer can clear the file or delete the file to remove previous reports.

Unreachable Questions

First one, the alternate questions which are part of the input primary questions in input export, will be mapped to same questions again. This is due to preserve the question-alternate question relation that was given previously. This may lead to less path coverage for alternate questions as the terms built for primary question will also be part of its alternate questions.

Questions at root node

Second one, the questions which are very dissimilar in corpus may not get grouped. These questions are placed in root node. Its bot developers responsibility to group them, the way they want.

The output from analyzer is a CSV file which shows error type and questions under that error. The path to reach the question is also available in the CSV. Following, is the sample analyzer CSV

Image alt text

Troubleshooting

Windows Operating system

Cannot open include fil e: 'basetsd.h': No such file or directory

C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\BIN\amd64\cl.exe /c /nolog o /Ox /MD /W3 /GS- /DNDEBUG -IC:\Python27\include -IC:\Python27\PC /Tchello.c /F obuild\temp.win-amd64-2.7\Release\hello.obj hello.c C:\Python27\include\pyconfig.h(227) : fatal error C1083: Cannot open include fil e: 'basetsd.h': No such file or directory error: command '"C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\BIN\amd64 \cl.exe"' failed with exit status 2

LNK1158 cannot run rc.exe x64 Visual Studio