This repository contains a Starter Kit (SK) that is designed to show you how to create your own answer retrieval application for StackExchange, using the Retrieve and Rank(R&R) service, a cognitive API from the Watson Developer Cloud. Information retrieval applications enable users to search for content in specific information sources. Creating an answer retrieval system has historically been a very complex technique requiring lots of configuration and lots of expert tuning. This starter kit uses the Retrieve and Rank API to support the entire process of creating such a system, from uploading your data to evaluating results, including training your answer retrieval system.
Only after completing the steps defined below in table of contents, you will be able to deploy the application to Bluemix using the button below:
- How this app works
- Getting started
- Running the notebooks
- Exploring with the UI
- Using your own data
- Improving relevance
This starter kit uses Jupyter Notebook, a web application that allows you to create and share documents that contain code, visualizations, and explanatory text. (Jupyter Notebook was formely known as iPython Notebook.) Jupyter Notebook automatically executes specific sections of Python code that are embedded in a notebook, displaying the results of those commands in a highlighted section below each code block. The Jupyter notebooks in this SK show you the process of building a custom ranker for the data on Travel Stack Exchange.
This SK has three primary components:
-
Two Jupyter notebooks, which show you the process of building an answer retrieval system using the Watson Retrieve and Rank service. These notebooks are:
Answer Retrieval
which shows how to create a basic SOLR collection and enhance it with a rankerCustom Scorer
, which shows how to add custom features to your ranker Once you have completed the notebooks, you can launch a simple application that shows how the answer retrieval system performs. Specifically, the application compares SOLR, a common open-source Information Retrieval system, with Retrieve and Rank's ranker, which reranks search results to be in an order more salient to a particular user. It also shows how you can extend the basic ranker with custom features that consider domain-specific information.
-
Bash shell scripts that enable you to train a ranker on data from any StackExchange question and answer site.
-
Python code to help you extract other content from StackExchange and pre-process it for use with the Retrieve and Rank service.
Once you complete the notebooks and understand the format of data expected by the Retrieve and Rank API, you should be able to train Retrieve and Rank on any dataset.
Before diving into the Jupyter notebooks, you should make sure you have all the prerequisites installed, and are familiar with the directory structure of the git repository that contains this SK.
You will need the following in order to use this SK:
- A Unix-based OS (or Cygwin)
- Git
- Node.js
- python
- Anaconda - Installing this package also installs the Jupyter notebook package, which includes
iPython
(now referred to asjupyter
) - A bluemix account
- An instance of the Retrieve and Rank service
If you are using a Linux system, the git
, anaconda
, python
, and
node.js
packages may be installable through your system's package
manager.
Use git
to clone the repository for this SK to your local machine. For example, using a command-line version of git, the command that you would execute is the following:
git clone git@github.com:watson-developer-cloud/answer-retrieval.git
The directory that you created when cloning the git repository for this SK contains the following subdirectories:
bin\
contains various bash and python scripts for interacting with the R&R APIconfig\
contains a configuration that tells SOLR how the StackExchange data is structured.custom-scorer\
contains the code necessary to train scorers for R&R that use custom featuresdata\
contains sample StackExchange data that is pre-processed for use by the Retrieve and Rank service. This data will be automatically uploaded to the Retrieve and Rank service as part of the Python code in the Config section of the Answer Retrieval notebook.notebooks\
contains the iPython notebooksstatic\
contains the static website assets, css, js, html
-
Install the dependencies using
pip
.pip install -r requirements.txt pip install -r notebooks/requirements.txt
-
Create a
.env
using.env.example
as example. You will need credentials for the Retrieve and Rank service. -
Start the application.
python server.py
The Jupyter notebooks show you the process of creating an information retrieval system, step-by-step, automatically executing specified sections of Python code. We used Jupyter notebooks because they encourage experimentation, which is an important part of developing any machine learning system.
You will need credentials in order to use R&R. These can be obtained after creating an account in (Bluemix)(http://bluemix.net) and creating an instance of the service in account. After these are done, you can click the "Service Credentials" entry in the left-hand navigation for that service in Bluemix to see your R&R Credentials.
Before starting the notebook, please add the username and password
from the credentials for the instance of the Retrieve and Rank service
that to created to the json file credentials.json
. This file is
located in the config
directory of this SK's repository. This
enables the notebooks to use these values throughout all of the code
blocks in the notebook.
To start the notebooks, make sure you are in the root directory of
your git
checkout of the SK repository, and execute the command
jupyter notebook
. This will start the Jupyter notebook server, and
open a browser window. Once the browser window is open, click on
notebooks
, and then open the notebook labeled Answer- Retrieval.ipynb
. Follow the instructions in there to create your own
ranker.
The Populate the Collection section of the Answer Retrieval notebook loads sample data into the Solr collection that was created by previous code blocks in that Notebook. This sample data is located in the config
directory of the repository for this SK.
Important: The Sample Application UI will not have any rankers to display results for until you have stepped through the iPython notebooks.
Now that you have completed the iPython notebooks, you have 2 ways to search and compare your results of your experiments: basic Solr versus default ranker and basic Solr versus a ranker with custom scoring features. If you want to explore how these different rankers perform, you just have to modify a few things in your local environment properties (.env file).
- For using the UI with the default ranker, modify the SHOW_DEFAULT_RANKER property to "TRUE" and set RANKER_ID property to the default ranker id created at the end of the Answer Retrieval Notebook.
- For using the UI with the custom ranker, modify the SHOW_DEFAULT_RANKER property to "FALSE" and set RANKER_ID property to the custom ranker id created at the end of the Custom Scorer Notebook.
If you want to train rankers with data from other StackExchange sites, you first need to download the dumps. Once you have chosen a dump, you can use bin/python/extract_stackexchange_dump.py
to convert it into a R&R-compatible format. If you wish to use another data source, consult the Retrieve and Rank documentationn, which explains how R&R expects incoming data to be formatted.
It is often necessary to look for additional features in your dataset that can be used to provide information to the ranker that can instruct it on how to identify results that are more relevant to others. These are implemented as a set of custom features known as custom scorers. In the case of the Stack Exchange community support scenario that was used to collect the sample adata for this SK, you have access to metadata about each answer. Examples of this sort of metadata include the following:
- User Reputation: a rough measurement of how much the community trusts an expert author. Community users gain reputation points when a question that they have asked is voted up, an answer that they have made is voted up, an answer that they have made is accepted, and other criteria.
- UpVotes : the number of times that someone other than the expert has accepted an expert's answer as a pertinent response.
- DownVotes: the number of times that someone other than the expert has rejected an expert's answer as a pertinent response.
- Number of Views: the overall popularity of the related topic/question
- Answer Accepted: a Boolean that identifies whether the answer provided by the expert was accepted by the original author
Metadata such as the items in the preceding list can be used to create custom features that provide information to the ranker which it can use to enhance learning about the problem domain. If the metadata has a strong correlation to predicting relevance, you should see improvements to overall relevance metrics.
The custom features that you can create for a Retrieve and Rank implementation typically fall into 3 categories:
- Document
- Query
- Query and Document
You can write custom scorers using any of the following:
- DocumentScorer - A document scorer is a class whose input to the score method is a field or fields for a single Solr document. Consider a class called DocumentViews, which creates a score based on the number of views for a given topics and is represented a field in the Solr document.
- QueryScorer - A query scorer is a class whose input to the score method is a set of query params for a Solr query. Consider a class called IsQueryOnTopicScorer, which scores queries based on whether it thinks the underlying query text is on topic for the application domain.
- QueryDocumentScorer - A query-document scorer is a class whose input to the score method is 1) a set of query params for a Solr query, and 2) a field or fields for a Solr document. Consider a class that scores the extent to which the "text" of a Solr document answers definitional questions. More specifically, the scorer will 1) identify if a query is asking for a definition and 2) if so, identify whether the document contains a likely definition or not.
The custom scorer notebook provided here as part of this starter kit provides access to a custom scoring framework allowing you to extract new features for the purposes of training a ranker.
This project enables the usage of custom features within the Retrieve & Rank service on Bluemix. This project was built in the Python programming language and uses the Flask micro-framework. To use this application, there is a Python script called server.py
, which exposes two endpoints to be consumed by the application that uses it.
The Flask server that is created within the script server.py
is intended to run as a "sidecar" within the rest of the application; that is, the parent application will make REST API calls to this "sidecar" service rather than making direct calls to the deployed Retrieve & Rank service. The principal difference is that the Flask server will handle the integration/injection of
custom features that have been registered.
There are 4 steps to set up the server to integrate custom features:
-
Configure your environment. The python flask server is already setup to run and extract features from custom scorers. The custom scorers are packaged as wheel package and need to be installed whenever a new custom scorerer is created.
-
Identify the custom scorers for your application. A custom scorer is a Python class that extracts a signal that is to be used for the ranker. There are 3 types of scorers supported:
- "QueryDocument" Scorer - Extracts a signal/score based on both the contents of a query and the contents of a Solr Document
- "Document" Scorer - Extracts a signal based on the contents of the Solr Document alone
- "Query" Scorer - Extracts a signal based on the contents of the query alone
To make sure that the server has the most up to date scorers, go to the 'Custom Scorer' notebook and follow instructions to build and install the wheels package.
-
Create a configuration file (see
config/features.json
as an example) to configure your application to consume these scorers. The configuration file must be a json file and must contain a "scorers" field, which is a list of individual scorer configurations. For each of the scorers that you identified in the previous step, the scorer configuration is a JSON object that must define the following:- an 'init_args' json object, whose fields are the arguments to the constructor for the scorer
- a 'type' field, which should be either 'query', 'document' or 'query_document', depending on the type of scorer. The type is used to identify the package within the 'retrieve_and_rank_scorer' project that contains the appropriate scorer
- a 'module' field, which is the name of the python module which contains the scorer
- a 'class' field, which is the name of the scorer class
For comparison, the
config/features.json
contains a single Document scorer, in the module document, with the class UpVoteScorer. This is to extract feature based on the positive votes that a post has received.
-
Start the Flask server by running the command
python server.py
Sample web applications that include the cf_deployment_tracker package included here may be configured to track deployments to IBM Bluemix and other Cloud Foundry platforms. The following information is sent to a Deployment Tracker service on each deployment:
- Python package version
- Python repository URL
- Application Name (
application_name
) - Space ID (
space_id
) - Application Version (
application_version
) - Application URIs (
application_uris
) - Labels of bound services
- Number of instances for each bound service and associated plan information
This data is collected from the server.py
file in the sample application and the VCAP_APPLICATION
and VCAP_SERVICES
environment variables in IBM Bluemix and other Cloud Foundry platforms. This data is used by IBM to track metrics around deployments of sample applications to IBM Bluemix to measure the usefulness of our examples, so that we can continuously improve the content we offer to you. Only deployments of sample applications that include code to ping the Deployment Tracker service will be tracked.