The project uses Spacy
to extract company names and dates
in different formats of resumes (in this case .pdfs and .docx)
without additional trainings. It uses the basic model from Spacy
without
tuning, so the accuracy is not satisfying. However, it shows an elementary
approach in natural language processing.
This is my first approach to NLP
, and this project gives me an overview
of the industry. With a list of resumes written in different formats,
what would the most effective approach to extract the company titles,
and the corresponding dates? This is different from Named-entity recognition
in normal texts, as company names do not appear in complete sentences.
The basic model of Spacy
is a multi-task CNN trained on OntoNotes with
blogs, news, and comments with an NER
accuracy of 85%. Since the model
is for general purpose, it might not perform well for NER
tasks for resumes.
In fact, when I tested it on an example resume, I see a lot of NER
falsely
classified. It is then critical to improve on the results without access
to large number of training samples.
I set up PyCharm
as my IDE and Conda
as Python Interpreter because I could
not install numpy with wheel in virtualenv environment with M1 chip. I
used Python 3.8
, and you can build the dependency as follows.
pip3 install -r requirements.txt
Here is a list of packages I used.
import fitz
import glob
import os
import docx2txt
import textract
from spacy.pipeline import EntityRuler
import spacy
I then looped through the resume folder and read the files into strings.
For PDF files, I used PyMuPDF
to convert them to strings. For my 21 samples
of PDF files, it only failed to read one of them. For docx files, I used
docx2txt
to convert them to strings. The conversion is not perfect for
any of the files, and I also tried to use textract
as a general approach,
which works for general types of files. If I had files with various
extension names, I would prefer to use textract
. It reads files into
bytes string, and you need to decode it with utf-8. I think PyMuPDF
works better for PDF files, since I only have two types of files.
filepath = "resume/*"
paths = glob.glob(filepath)
for filename in paths:
text = ""
ext = os.path.splitext(filename)[-1].lower()
if ext == '.pdf':
with fitz.open(filename) as doc:
for page in doc:
text += page.getText()
elif ext == '.docx':
text = docx2txt.process(filename)
else:
print(filename, " is not a pdf or docx file!")
After read files into strings, I need to find a way to extract the
company names efficiently. The first thing came to my mind was to
extract bold texts, since we would highlight the company names for
the most times. However, when I checked in the resumes, many of them
did not bold company names, so I had to find another approach. I then
realized that all resumes have a section titled "Work Experience"
, either
in upper cases or in lower cases. So I searched the strings for the keyword
"Experience"
or "EXPERIENCE"
, and extracted the following strings. In
this way, if the NER recognized company names in the education or skills
section, we would be able to ignore them safely without doing additional
works.
# parse to experience to get company names
if 'Experience' in text:
i = text.index('Experience')
text = text[i:]
else:
print("Keyword Experience not found in " + filename + ", search for keyword EXPERIENCE")
if 'EXPERIENCE' in text:
i = text.index('EXPERIENCE')
text = text[i:]
else:
print("EXPERIENCE not found either in " + filename + ", move to next resume")
I then moved to NER
part. After reading several articles, I decided to use
Spacy
instead of NLTK
. NLTK
is essentially a string processing library, where
each function takes strings as I/O. For a beginner, I think that Spacy
's
object-oriented approach seems more friendly.
nlp = spacy.load("en_core_web_lg")
ruler = EntityRuler(nlp)
capitalized_word = "([A-Z][a-z]+)"
corporate_stopwords = "([Ii]nc|[Cc]orp|[Cc]o|[Ll]lc|[Ll]td)"
patterns = [
{"label": "DATE", "pattern": [{"LOWER": "present"}]},
{"label": "DATE", "pattern": [{"LOWER": "now"}]},
{"label": "DATE", "pattern": [{"LOWER": "current"}]}]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler, before='ner')
Here I used the largest pretrained model of English in Spacy
(742 MB), because
it recognized more entities than the other two smaller models. I then used
the EntityRuler
to provide some patterns for the nlp object to match. In
this case, present/current/now
can represent working time, so we add them
to the pattern dictionary for matches. I added the ruler before other pipelines
so that the NER
would respect the existing entity spans and adjust its
predictions around it. I also tried to add company suffixes such as LTD./LLC./CORP./CO.
into ORG
patterns, but it did not work well. The NER
misidentified many
more entities with my written regex, so I commented it out.
doc = nlp(text)
for i in range(len(doc.ents) - 1):
string = ""
ent = doc.ents[i]
j = i + 1
if ent.label_ == "ORG":
while j < len(doc.ents) and doc.ents[j].start_char - ent.end_char <= 50:
ent_next = doc.ents[j]
if ent_next.label_ == "DATE":
ent_next_date = doc.ents[j + 1]
if ent_next_date.label_ == "DATE" and ent_next_date.start_char - ent_next.end_char <= 10:
string += "Company: " + ent.text.strip().rstrip('-').rstrip() + ", Date: " + \
ent_next.text.strip().rstrip('-') + " - " + ent_next_date.text.strip() + "\n"
else:
string += "Company: " + ent.text.strip().rstrip('-').rstrip() + ", Date: " + \
ent_next.text + "\n"
output_file.write(string)
break
j += 1
I then did a simple filter rule by calculating the relative positions between
entity ORG
and DATE
. I noticed that after listing every company in the
resume, one will also add the corresponding time range. So I found every
ORG
with a following DATE
that is within 50 characters. I choose 50
characters because sometime people will list their job titles after the
company names, and there will be trailing whitespaces as well. In this way,
I would be able to figure out the real company names along with the time
range, without having to worry too much about the companies mentioned in
the working experience description, because they would less likely have a
date following it.
After some testings and examinations, I also realized that Spacy
is not
identifying the DATE
correctly, because sometimes it seemed not recognized
the -
between the DATE
. Instead, it identified them as two separate DATE
.
One way to solve this is to check the entity right after it. If it is indeed
a DATE
, we can then safely assume it connects with the previous DATE
.
However, this solution is not perfectly accurate, because some resumes have
the following DATE
formatted to the end of the next line, which will not be
the next entity. However, my solution already used time complexity of O(n^2).
So it seemed to work the best to just check the next entity instead of checking
the next DATE
and calculating its relative position.
I wrote the outputs to a text file with each of the file names, the
company names (ORG
) within the resume, and the corresponding dates(DATE
). Some
resumes seem to have better performance than others, and some are doing
poorly with nonsense outputs. I assume that part of the reason that the
program outputs more false company names is because I used a large pretrained
model without tuning. If I switched to the smallest network, I got significantly
fewer outputs, but it also gave away some correct results. I finally decided
to use the medium size, since it gave the most correct results.
The results are in output.txt
,
and the source code is in main.py
.
With such method of filtering, I was able to pull up company names and time ranges.
However, many of my outputs include false company names, which either belong to
other sections of the resumes, or are job titles/technology terms. I think that
if I use enough training data, the results will be improved significantly.