/textclusteringDBSCAN

Document clustering using Density Based Spatial Clustering (DBSCAN) [undergrad NLP class project 2015@TU]

Primary LanguageJupyter Notebook

textClusteringDBSCAN : Clustering text using Density Based Spatial Clustering (DBSCAN) using TF-IDF, FastText, GloVe word vectors

This is a library for performing unsupervised lingustic functionalities based on textual fields on your data. An API will also be released for real-time inference. This is a small part of project fling, which is an opensource linguistic library designed for easy integration to applications.

Feature exploration and visualization

  • Textual feature visualization

Feature engineering:

  • Add features based on transformers based models. (unsupersized)
  • Add features based on tf-idf as soft features and combine with pretrained word vector based features. (supervised)
  • Compare different features.

Usage

Basic usage instructions. As the code is in development, it might not be stable. More details will be added by 12/31/2020 for proper usage of the library.

Reading data

from textclustering import utilities as ut
from textclustering import tfidfModule as tfm

For now, operations are performed in Pandas dataframes, and the file format we read is csv.

#change operating folder      
os.chdir("/Users/arnabborah/Documents/repositories/textclusteringDBSCAN/scripts/")

#read the .csv data file using the dataProcessor class
rp = tfm.dataProcessor("../datasets/DataAnalyst.csv")

using the generic TF-IDF module (unsupervised)

#create a flingTFIDF object around the pre-processed daa
ftf = tfm.flingTFIDF(rp.dataInitialSmall,'Job Description')

# tokenization, customizable
ftf.smartTokenizeColumn()

# get Term Frequency of each document, and store add it as an object, in a new column
ftf.getTF()

# compute Inverse Document Frequencies across the entire vocabulary
ftf.computeIDFmatrix()

# get TFIDF, and store it as a new column in data, tf-idf
ftf.getTFIDF()

# compute sum of all tf-idf values and add it as a new column
ftf.createDistanceMetadata()
ftf.writeToFile()

using the categeorical TF-IDF module (semi-supervised)

from textclustering import categoricalCharacteristicModule as ccm

rp = dataProcessor("../datasets/DataAnalyst.csv")

# performing custom categorical operations on the data-frame
rp.customProcessData()

fcat = flingCategoricalTFIDF()
allfnames = ft.getallfilenames("/Users/arnabborah/Documents/repositories/textclusteringDBSCAN/processFiles/trainCatFiles")
ft.computeTFIDFallfiles(allfnames)

Notebook for showing usage

import os
import warnings
warnings.filterwarnings('ignore')
from textclustering import utilities as ut
from textclustering import tfidfModule as tfm

#change operating folder      
os.chdir("/Users/arnabborah/Documents/repositories/textclusteringDBSCAN/scripts/")
#read the .csv data file using the dataProcessor class
rp = tfm.dataProcessor("../datasets/DataAnalyst.csv")
                                  Job Description  Company Name
Industry                                                       
-1                                            353           352
IT Services                                   325           325
Staffing & Outsourcing                        323           323
Health Care Services & Hospitals              151           151
Consulting                                    111           111
...                                           ...           ...
Chemical Manufacturing                          1             1
Pet & Pet Supplies Stores                       1             1
Consumer Product Rental                         1             1
Metals Brokers                                  1             1
News Outlet                                     1             1

[89 rows x 2 columns]
#create a flingTFIDF object around the pre-processed daa
ftf = tfm.flingTFIDF(rp.dataInitialSmall,'Job Description')

# tokenization, customizable
ftf.smartTokenizeColumn()

# get Term Frequency of each document, and store add it as an object, in a new column
ftf.getTF()

# compute Inverse Document Frequencies across the entire vocabulary
ftf.computeIDFmatrix()

# get TFIDF, and store it as a new column in data, tf-idf
ftf.getTFIDF()

# compute sum of all tf-idf values and add it as a new column
ftf.createDistanceMetadata()
[ ================================================== ] 100.00%
Adding term frequency column based on stopsRemoved
[ ================================================== ] 100.00%
Computing list of words for IDF...

Created list of terms for IDF matrix with 27075  terms.

Computing global IDF matrix...

[ ================================================== ] 100.00%
Computing and adding TF-IDF column based on stopsRemoved
[ ================================================== ] 100.00%
os.chdir("/Users/arnabborah/Documents/repositories/textclusteringDBSCAN/scripts/")
ftf.data.to_pickle('../processFiles/data_tfidf_processed.pkl')
os.chdir("/Users/arnabborah/Documents/repositories/textclusteringDBSCAN/")
# load dataset with tf-idf vectors and load pretrained GloVe word vectors
from textclustering import flingPretrained as pre
import pandas as pd

dataProcessed = pd.read_pickle('processFiles/data_tfidf_processed.pkl')
fdb = pre.flingPretrained(dataProcessed)
fdb.loadPretrainedWordVectors('glove')

# adding glove vectors for every document
fdb.addDocumentGloveVector()
DBSCAN initialized!

Loading Glove Model

400000  words loaded!

GloVe Vectors Loaded!
# use DBSCAN clustering on the glove vectors loaded in the previos
from textclustering import flingDBSCAN as fdbscan

fdbscan1 = fdbscan.flingDBSCAN(fdb.data,None,25,'glove')
fdbscan1.dbscanCompute()
fdbscan1.addClusterLabel('glove-clusterID')
fdbscan1.printClusterInfo()
flingDBSCAN initialized!

computing best distance
[ ================================================== ] 100.00%

png

Best epsilon computed on GLOVE = 0.6544420699360174 


initiating DBSCAN Clustering with glove vectors

[                                                    ] 0.04%
 ----  cluster_1_ assigned to 565 points! ----
[                                                    ] 0.09%
 ----  cluster_2_ assigned to 855 points! ----
[                                                    ] 0.18%
 ----  cluster_3_ assigned to 58 points! ----
[                                                    ] 0.31%
 ----  cluster_4_ assigned to 119 points! ----
[                                                    ] 0.53%
 ----  cluster_5_ assigned to 109 points! ----
[                                                    ] 1.07%
 ----  cluster_6_ assigned to 53 points! ----
[                                                    ] 1.91%
 ----  cluster_7_ assigned to 37 points! ----
[ =                                                  ] 2.26%
 ----  cluster_8_ assigned to 55 points! ----
[ ===                                                ] 6.79%
 ----  cluster_9_ assigned to 35 points! ----
[ =======                                            ] 15.85%
 ----  cluster_10_ assigned to 32 points! ----
[ ====================                               ] 41.59%
 ----  cluster_11_ assigned to 27 points! ----
[ ================================================== ] 100.00%
 11 clusters formed!
Cluster characteristics:
 -- vectors: glove
 -- minPts: 25
 -- EstimatedBestDistance 0.6544420699360174
 -- 11 clusters formed!
 -- 1945 points assigned to clusters!
 -- 308 noise points!

 -- 13.670661340434975 % noise!
# converting tf-idf into vectors
fdb.tfidf2vec('tf-only')
fdb.tfidf2vec('tf-idf')

# clustering documents based on 
fdbscan2 = fdbscan.flingDBSCAN(fdb.data,None,25,'tfidf')
fdbscan2.dbscanCompute()
fdbscan2.addClusterLabel('tfidf-clusterID')
fdbscan2.printClusterInfo() 
flingDBSCAN initialized!

computing best distance
[ ================================================== ] 100.00%

png

Best epsilon computed on GLOVE-TFIDF = 1.4628292329952732 


initiating DBSCAN Clustering with tfidf vectors

[                                                    ] 0.04%
 ----  cluster_1_ assigned to 810 points! ----
[                                                    ] 0.09%
 ----  cluster_2_ assigned to 695 points! ----
[                                                    ] 0.31%
 ----  cluster_3_ assigned to 61 points! ----
[                                                    ] 0.93%
 ----  cluster_4_ assigned to 347 points! ----
[ =                                                  ] 3.86%
 ----  cluster_5_ assigned to 26 points! ----
[ =============                                      ] 26.14%
 ----  cluster_6_ assigned to 44 points! ----
[ ================                                   ] 32.45%
 ----  cluster_7_ assigned to 27 points! ----
[ ================================================== ] 100.00%
 7 clusters formed!
Cluster characteristics:
 -- vectors: tfidf
 -- minPts: 25
 -- EstimatedBestDistance 1.4628292329952732
 -- 7 clusters formed!
 -- 1995 points assigned to clusters!
 -- 258 noise points!

 -- 11.451398135818907 % noise!
fdb.data
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Job Description Company Name Industry stopsRemoved tfMatrix sumTFIDF glove-vector glove-clusterID tfidf2vec-tf tfidf2vec-tfidf tfidf-clusterID
0 Are you eager to roll up your sleeves and harn... Vera Institute of Justice\n3.2 Social Assistance eager roll sleeves harness data drive policy c... word tf tf-idf 0 data... 811.569328 [0.20507256798029552, 0.05984949950738914, 0.0... cluster_0_ [0.2986073091133004, 0.05040200935960588, 0.09... [0.26263354824176166, -0.023444644206149418, -... cluster_0_
1 Overview\n\nProvides analytical and technical ... Visiting Nurse Service of New York\n3.8 Health Care Services & Hospitals overview provides analytical technical support... word tf tf-idf 0 dat... 415.287583 [0.23643422682926837, -0.055056957317073156, 0... cluster_1_ [0.4055475764227641, -0.07285501829268287, 0.1... [0.35240058786555273, -0.1412004425681622, 0.0... cluster_1_
2 We�re looking for a Senior Data Analyst who ... Squarespace\n3.4 Internet we�re looking senior data analyst love mento... word tf tf-idf 0 data ... 439.815932 [0.155861351576923, 0.11735425461538473, -0.05... cluster_2_ [0.283220747730769, 0.14354892653846157, 0.044... [0.2563749918506738, 0.17575736117618113, -0.0... cluster_2_
3 Requisition NumberRR-0001939\nRemote:Yes\nWe c... Celerity\n4.1 IT Services requisition numberrr remoteyes collaborate cre... word tf tf-idf 0 � ... 569.217931 [0.2306739880813952, 0.09347254534883724, -0.0... cluster_2_ [0.29634610203488354, 0.10983982558139535, 0.0... [0.2966705423736133, 0.028126685382837024, -0.... cluster_2_
4 ABOUT FANDUEL GROUP\n\nFanDuel Group is a worl... FanDuel\n3.9 Sports & Recreation fanduel group fanduel group worldclass team br... word tf tf-idf 0 fanduel... 420.106719 [0.12914707201834857, 0.11582829587155963, 0.0... cluster_3_ [0.17368260871559627, 0.10919291513761473, 0.0... [0.021771101166884813, 0.16355587986765768, -0... None
... ... ... ... ... ... ... ... ... ... ... ...
2248 Maintains systems to protect data from unautho... Avacend, Inc.\n2.5 Staffing & Outsourcing maintains systems protect data unauthorized us... word tf tf-idf 0 ... 43.940807 [0.2738081315789473, -0.001255321052631562, 0.... None [0.2949110263157894, 0.029555310526315794, 0.0... [0.23112386279259817, -0.08318866123802247, -0... cluster_4_
2249 Position:\nSenior Data Analyst (Corporate Audi... Arrow Electronics\n2.9 Wholesale position senior data analyst corporate audit j... word tf tf-idf 0 ... 439.042957 [0.2200468355481728, 0.10710706677740867, 0.04... cluster_1_ [0.3396034966777404, 0.09931764750830561, 0.09... [0.3077493047461843, 0.06387599003189207, 0.06... cluster_1_
2250 Title: Technical Business Analyst (SQL, Data a... Spiceorb -1 title technical business analyst sql data anal... word tf tf-idf 0 busin... 205.978695 [0.36188271052631577, 0.05400915065789475, 0.0... cluster_2_ [0.5060029144736842, 0.04490494473684211, 0.11... [0.45506833532863533, 5.3025424212786644e-05, ... cluster_2_
2251 Summary\n\nResponsible for working cross-funct... Contingent Network Services\n3.1 Enterprise Software & Network Solutions summary responsible working crossfunctionally ... word tf tf-idf 0 ... 364.177527 [0.25247974618181807, 0.07676844581818185, -0.... cluster_2_ [0.34654995709090924, 0.07137524545454547, 0.0... [0.27937433353352015, 0.08437047685035409, -0.... cluster_1_
2252 You.\n\nYou bring your body, mind, heart and s... SCL Health\n3.4 Health Care Services & Hospitals bring body mind heart spirit work senior quali... word tf tf-idf 0 data ... 366.509859 [0.23890638028806577, 0.1815799016460906, -0.0... cluster_2_ [0.3220337218518514, 0.22893831193415645, 0.07... [0.2850343471866271, 0.2451438898926933, -0.08... cluster_2_

2253 rows × 11 columns