Correlation of text content across documents using Watson Natural Language Understanding, Python NLTK and IBM Data Science experience
Data Science Experience is now Watson Studio. Although some images in this code pattern may show the service as Data Science Experience, the steps and processes will still work.
In this code pattern we will use Jupyter notebooks in IBM Data Science experience(Watson Studio) to correlate text content across documents with Python NLTK toolkit and IBM Watson Natural Language Understanding. The correlation algorithm is driven by an input configuration json that contains the rules and grammar for building the relations. The configuration json document can be modified to obtain better correlation results between text content across documents.
When the reader has completed this code pattern, they will understand how to:
- Create and run a Jupyter notebook in Watson Studio.
- Use Object Storage to access data and configuration files.
- Use IBM Watson Natural Language Understanding API to extract metadata from documents in Jupyter notebooks.
- Extract and format unstructured data using simplified Python functions.
- Use a configuration file to specify the co-reference and relations grammar.
- Store the processed output JSON in Object Storage.
The intended audience for this code pattern is developers who want to learn a method for correlation of text content across documents. The distinguishing factor of this code pattern is that it allows a configurable mechanism of text correlation.
-
IBM Watson Studio: Analyze data using RStudio, Jupyter, and Python in a configured, collaborative environment that includes IBM value-adds, such as managed Spark.
-
IBM Cloud Object Storage: An IBM Cloud service that provides an unstructured cloud data store to build and deliver cost effective apps and services with high reliability and fast speed to market.
-
Watson Natural Language Understanding: A IBM Cloud service that can analyze text to extract meta-data from content such as concepts, entities, keywords, categories, sentiment, emotion, relations, semantic roles, using natural language understanding.
- Jupyter Notebooks: An open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.
Follow these steps to setup and run this code pattern. The steps are described in detail below.
- Sign up for Watson Studio
- Create IBM Cloud services
- Create the notebook
- Add the data and configuraton file
- Update the notebook with service credentials
- Run the notebook
- Analyze the results
Sign up for IBM's Watson Studio. By creating a project in Watson Studio a free tier Object Storage
service will be created in your IBM Cloud account. Take note of your service names as you will need to select them in the following steps.
Note: When creating your Object Storage service, select the
Free
storage type in order to avoid having to pay an upgrade fee.
Create the following IBM Cloud service and name it wdc-NLU-service:
- In Watson Studio, click on
Create notebook
to create a notebook. - Create a project if necessary, provisioning an object storage service if required.
- In the
Assets
tab, select theCreate notebook
option. - Select the
From URL
tab. - Enter a name for the notebook.
- Optionally, enter a description for the notebook.
- Enter this Notebook URL: https://github.com/IBM/watson-document-co-relation/blob/master/notebooks/watson_correlate_documents.ipynb
- Select the free Anaconda runtime.
- Click the
Create
button.
- From the
My Projects > Default
page, UseFind and Add Data
(look for the10/01
icon) and itsFiles
tab. - Click
browse
and navigate to this repowatson-document-co-relation/data/sample_text_1.txt
- Click
browse
and navigate to this repowatson-document-co-relation/data/sample_text_2.txt
- Click
browse
and navigate to this repowatson-document-co-relation/configuration/sample_config.txt
Note: It is possible to use your own data and configuration files. If you use a configuration file from your computer, make sure to conform to the JSON structure given in
configuration/sample_config.txt
.
If you use your own data and configuration files, you will need to update the variables that refer to the data and configuration files in the Jupyter Notebook.
In the notebook, update the global variables in the cell following 2.3 Global Variables
section.
Replace the sampleTextFileName1
,sampleTextFileName2
with the name of your data file and sampleConfigFileName
with your configuration file name.
Select the cell below 2.1 Add your service credentials from IBM Cloud for the Watson services
section in the notebook to update the credentials for Watson Natural Language Understanding.
Open the Watson Natural Language Understanding service in your IBM Cloud Dashboard and click on your service, which you should have named wdc-NLU-service
.
Once the service is open click the Service Credentials
menu on the left.
In the Service Credentials
that opens up in the UI, select whichever Credentials
you would like to use in the notebook from the KEY NAME
column. Click View credentials
and copy username
and password
key values that appear on the UI in JSON format.
Update the username
and password
key values in the cell below 2.1 Add your service credentials from IBM Cloud for the Watson services
section.
-
Select the cell below
2.2 Add your service credentials for Object Storage
section in the notebook to update the credentials for Object Store. -
Delete the contents of the cell
-
Use
Find and Add Data
(look for the10/01
icon) and itsFiles
tab. You should see the file names uploaded earlier. Make sure your active cell is the empty one below2.2 Add...
-
Select
Insert to code
(below your sample_text.txt). -
Click
Insert Credentials
from drop down menu. -
Make sure the credentials are saved as
credentials_1
.
When a notebook is executed, what is actually happening is that each code cell in the notebook is executed, in order, from top to bottom.
IMPORTANT: The first time you run your notebook, you will need to install the necessary packages in section 1.1 and then
Restart the kernel
.
Each code cell is selectable and is preceded by a tag in the left margin. The tag
format is In [x]:
. Depending on the state of the notebook, the x
can be:
- A blank, this indicates that the cell has never been executed.
- A number, this number represents the relative order this code step was executed.
- A
*
, this indicates that the cell is currently executing.
There are several ways to execute the code cells in your notebook:
- One cell at a time.
- Select the cell, and then press the
Play
button in the toolbar.
- Select the cell, and then press the
- Batch mode, in sequential order.
- From the
Cell
menu bar, there are several options available. For example, you canRun All
cells in your notebook, or you canRun All Below
, that will start executing from the first cell under the currently selected cell, and then continue executing all cells that follow.
- From the
- At a scheduled time.
- Press the
Schedule
button located in the top right section of your notebook panel. Here you can schedule your notebook to be executed once at some future time, or repeatedly at your specified interval.
- Press the
After running each cell of the notebook under Correlate text, the results will display.
The document similarity score is computed using the cosine distance function in NLTK module. The document similarity results can be enhanced by adding to the stop words or text tags. The words added to stop words will be ignored for comparison. The word tags from watson text classifier or any custom tags added will be accounted for the comparison.
The configuration json controls the way the text is correlated. The correlation involves two aspects - co-referencing and relation determination. The configuration json contains the rules and grammar for co-referencing and determining relations. The output from Watson Natural Language Understanding and Python NLTK toolkit is processed based on the rules and grammar specified in the configuration json to come up with the correlation of content across documents.
We can modify the configuration json to add more rules and grammar for co-referencing and determining the relations. The text content correlation results can be enhanced without changes to the code.
We can see from the 6. Visualize correlated text
in the notebook the correlations between the text in the two sample documents that we provided. The output seen below is the augmented output from Watson Natural Language Understanding with the relationships extracted from the rules methodology explained in this pattern.
In addition to it the similarity between the two sample texts that we provided is computed in the notebook section 5. Correlate text
. The similarity score between the two sample text is seen as 0.790569415042.
Mine insights from software development artifacts
Get insights on personal finance data
This code pattern is licensed under the Apache Software License, Version 2. Separate third party code objects invoked within this code pattern are licensed by their respective providers pursuant to their own separate licenses. Contributions are subject to the Developer Certificate of Origin, Version 1.1 (DCO) and the Apache Software License, Version 2.