This readme outlines the steps in Python to use topic modeling on US patents for 3M and seven competitors.
There are five sections of the code:
- Modules & Working Directory
- Load Dataset, Set Column Names and Sample (Explore) Data
- Data Wrangling (Tokenize, Clean, TF-IDF)
- Topic Modeling (K-Means, LDA, Topic Word Cloud)
- K-Means Clustering on the Topic Probabilities
This code was created as a collection of several online references. Each reference is labelled with a [number] tag that will be used through this document to cite references.
Somewhat technical:
- [1] Introduction to Bag-of-Words modeling in Python
- [2] Applications and Challenges of Text Mining with Patents
- [3] Topic Modeling Visualizations
- [4] Stemming & Lemmatization
Very technical:
- [5] NLTK Homepage
- [6] gensim Homepage
- [7] LDA Modeling for Python
- [8] Topic Modeling (& K-Means) with Gensim
- [9] Constructing a broad-coverage lexicon for text mining in the patent domain
- [10] Document Clustering with Similarity
- [11] DBSCAN
- [12] Identifying Bigrams using gensim
For Python, call modules that will be used in this script.
# from nltk
from nltk.corpus import stopwords # Import the stop word list
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer # Lemmatization
# modules that are used
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import re
import csv
import logging
from scipy.spatial.distance import cdist, pdist
from wordcloud import WordCloud
# import gensim
import gensim
from gensim import corpora, models, similarities
Set home directory
# home directory - modify to your own homedirectory
os.chdir("/home/ryanceros/Dropbox/Project - Big Data Analytics/WordCloud")
The dataset is named "updatedCompanies.csv". It includes nearly 33,000 patents for eight companies (3M and seven competitors).
# sample dataset
dataset = 'updatedCompanies.csv'
# Load dataset
rawData = pd.read_csv(dataset, header=None)
# Rename the columns as CSV does not contain headers
rawData.columns = ["PatentNumber","CompanyName","Company","PatentAssignee",
"YearGranted","YearApplied","Year","PatentClass1","PatentClass2","PatentClassClean",
"ClassName","PatentTitle","PatentAbstract"]
# Check Shape
exampleData.shape
# ((32405, 13)
exampleData.columns.values
#array(['PatentNumber', 'CompanyName', 'PatentAssignee', 'YearGranted',
# 'YearApplied', 'PatentClass1', 'PatentClass2', 'ClassName',
# 'PatentTitle', 'PatentAbstract'], dtype=object)
[4] was modified to create the script StemLemma.py
exec(open("StemLemma.py").read())
Using an example (num = 560), explore fields including the Abstract Bag-of-Words.
num=560
print "Company Name: %s" % (exampleData["Company"][num])
print("")
print("Patent Title: " + exampleData["PatentTitle"][num])
print("")
print("Class Name: " + exampleData["ClassName"][num])
print("")
print("Class Number (Left 3): %s " % exampleData["PatentClassClean"][num])
print("")
print("Abstract: " + exampleData["PatentAbstract"][num])
print("")
print("Abstract Bag of Words: " + patent_to_words(exampleData["PatentAbstract"][num]))
This section first cleans and tokensizes the patent abstracts into unigrams. Next, using gensim's Phrases function, we create additional bigrams to include in the topic modeling. We used [12] as a reference to create the bigrams.
# Get the number of reviews based on the dataframe column size
num_patents = exampleData["PatentAbstract"].size
# Initialize an empty list to hold the clean reviews
clean_abstracts = []
# Loop over each review; create an index i that goes from 0 to the length
# of the patent list
for i in xrange( 0, num_patents ):
# Call our function for each one, and add the result to the list of
patent = patent_to_words(exampleData["PatentAbstract"][i])
array = patent.split()
clean_abstracts.append(array)
# Identify Bigrams using gensim's Phrases function
bigram = models.Phrases(clean_abstracts)
final_abstracts = []
for i in xrange(0,num_patents):
sent = clean_abstracts[i]
temp_bigram = bigram[sent]
final_abstracts.append(temp_bigram)
# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(final_abstracts)
# convert tokenized documents into a document-term matrix (bag-of-words)
corpus = [dictionary.doc2bow(text) for text in final_abstracts]
#TF IDF
tfidf = models.TfidfModel(corpus, normalize=True)
corpus_tfidf = tfidf[corpus]
Run KMeans.py to create KMeans function and to determine the number of topics. This section used the method laid out in [8].
exec(open("KMeans.py").read())
gensim
is a text mining module. [6] is the official website of gensim and provides several introductory tutorials on using the modules. [7] and [8] were used to create this section.
# generate LDA model
NUM_TOPICS = 5
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
level=logging.INFO)
# Project to LDA space
%time ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=NUM_TOPICS, id2word = dictionary, passes=100)
ldamodel.print_topics(NUM_TOPICS)
docTopicProbMat = ldamodel.get_document_topics(corpus,minimum_probability=0)
listDocProb = list(docTopicProbMat)
This step cleans up the output of LDA (topic probabilities for each document) and converts it to a pandas dataframe to make analysis easier.
probMatrix = np.zeros(shape=(num_patents,NUM_TOPICS))
for i,x in enumerate(listDocProb): #each document i
for t in x: #each topic j
probMatrix[i, t[0]] = t[1]
df = pd.DataFrame(probMatrix)
This step creates word clouds for each of the topics. This section was created referencing [8].
final_topics = ldamodel.show_topics(num_words = 20)
curr_topic = 0
for line in final_topics:
line = line.strip()
scores = [float(x.split("*")[0]) for x in line.split(" + ")]
words = [x.split("*")[1] for x in line.split(" + ")]
freqs = []
for word, score in zip(words, scores):
freqs.append((word, score))
wordcloud = WordCloud(max_font_size=40).generate_from_frequencies(freqs)
plt.figure()
plt.imshow(wordcloud)
plt.axis("off")
curr_topic += 1
"Synthetic Materials":
"Data & Information":
"Chemistry":
"Electrical":
"Energy":
This step creates a topic probability heatmap for all documents. It is unordered so the probabilities will be scattered since they are based on the order of the dataset. [3] was referenced to create this part of the code.
plt.pcolor(df.transpose(), norm=None, cmap='Blues')
topic_labels = ['Synthetic_Material',
'Data_Information',
'Chemistry',
'Electrical',
'Energy_Turbine'
]
plt.yticks(np.arange(df.shape[1])+0.5, topic_labels)
plt.colorbar(cmap='Blues')
This section samples a range of potential number of topics (1 to 11). The within sum of squares is calculated and the "elbow" rule is used to deduce that k should be 5. The within sum of squares is exported as a csv file.
k_range = range(1,11)
k_means_var = [KMeans(n_clusters=k).fit(df) for k in k_range]
centroids = [X.cluster_centers_ for X in k_means_var]
k_euclid = [cdist(df, cent, 'euclidean') for cent in centroids]
dist = [np.min(ke,axis=1) for ke in k_euclid]
wcss = [sum(d**2) for d in dist]
dfwcss = pd.DataFrame(wcss)
dataset = 'WCSS.csv'
dfwcss.to_csv(dataset, quotechar='\"', quoting=csv.QUOTE_NONNUMERIC,delimiter=',')
#tss = sum(pdist(df)**2)/df.shape[0]
#bss = tss - wcss
k = 5
kmeans = KMeans(n_clusters=k).fit(df)
clusters = kmeans.labels_
dfclusters = pd.DataFrame(clusters)
# Append Patent Number, Company and Class Name
df.columns = ['Synthetic_Material',
'Data_Information',
'Chemistry',
'Electrical',
'Energy_Turbine']
df["PatentNumber"] = exampleData["PatentNumber"]
df["Company"] = exampleData["Company"]
df["ClassName"] = exampleData["ClassName"]
df["PatentTitle"] = exampleData["PatentTitle"]
df["PatentAssignee"] = exampleData["PatentAssignee"]
df["Year"] = exampleData["Year"]
df["Cluster"] = dfclusters
This step saves the patent dataset (with LDA topic probabilities and K-means cluster). This step also reruns the probability heat map but this time ordered by the topic cluster.
# Save in a new directory
os.chdir("/home/ryanceros/Dropbox/Project - Big Data Analytics/WordCloud/CompetitorLDA")
dataset = 'ProbDocUpdated.csv'
df.to_csv(dataset, quotechar='\"', quoting=csv.QUOTE_NONNUMERIC,delimiter=',')
with open("topicProb.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(final_topics)
newPlot = df.sort(['Cluster'], ascending=[1])
newPlot2=newPlot[topic_labels]
plt.pcolor(newPlot2.transpose(), norm=None, cmap='Blues')
plt.yticks(np.arange(5)+0.5, topic_labels)
plt.colorbar(cmap='Blues')