This is a Machine learning project for screening of resumes based on the type of job and the content with the help of NLP techniques.
Jupyter Notebook
A Machine Learning Project for Screening Resumes
Exploratory Data Analysis (EDA)
Importing the necessary libraries, reading the data and performing basic checks on the data
# importing the required librariesimportnumpyasnpimportpandasaspdpd.set_option("display.precision", 2)
importseabornassnssns.set_style('whitegrid')
importmatplotlib.pyplotaspltimportrefromsklearn.preprocessingimportLabelEncoderfromsklearn.model_selectionimporttrain_test_splitfromsklearn.feature_extraction.textimportTfidfVectorizerfromscipy.sparseimporthstackfromsklearn.multiclassimportOneVsRestClassifierfromsklearn.neighborsimportKNeighborsClassifierfromsklearnimportmetrics
# importing and reading the .csv filedf=pd.read_csv('ResumeDataSet.csv')
print("The number of rows are", df.shape[0],"and the number of columns are", df.shape[1])
df.head()
The number of rows are 962 and the number of columns are 2
Category
Resume
0
Data Science
Skills * Programming Languages: Python (pandas...
1
Data Science
Education Details \r\nMay 2013 to May 2017 B.E...
2
Data Science
Areas of Interest Deep Learning, Control Syste...
3
Data Science
Skills � R � Python � SAP HANA � Table...
4
Data Science
Education Details \r\n MCA YMCAUST, Faridab...
# Checking the information of the dataframe(i.e the dataset)df.info()
# Checking all the different unique valuesdf.nunique()
Category 25
Resume 166
dtype: int64
Plotting the share of each Category as a count plot and pie plot
# Plotting the distribution of Categories as a Count Plotplt.figure(figsize= (15,15))
sns.countplot(y="Category", data=df)
df["Category"].value_counts()
# Plotting the distribution of Categories as a Pie Plotplt.figure(figsize= (18,18))
Category=df['Category'].value_counts().reset_index()['Category']
Labels=df['Category'].value_counts().reset_index()['index']
plt.title("Categorywise Distribution", fontsize=20)
plt.pie(Category, labels=Labels, autopct='%1.2f%%', shadow=True)
df["Category"].value_counts()*100/df.shape[0]
Cleaning out all the unnecessary content from the Resume column
# Function to clean the datadefclean(data):
data=re.sub('httpS+s*', ' ', data) # Removing the linksdata=re.sub('RT|cc', ' ', data) # Removing the RT and ccdata=re.sub('#S+', ' ', data) # Removing the hashtagsdata=re.sub('@S+', ' ', data) # Removing the mentionsdata=data.lower() # Changing the test to lowercasedata=''.join([iif32<ord(i) <128else' 'foriindata]) # Removing all the special charactersdata=re.sub('s+', 's', data) # Removing extra whitespacesdata=re.sub('[%s]'%re.escape("""!"#$%&'()*+,-./:;<=>?@[]^_`{|}~"""), ' ', data) # Removing punctuationsreturndatacleaned_df=df['Category'].to_frame()
cleaned_df['Resume'] =df['Resume'].apply(lambdax: clean(x)) # Applying the clean function cleaned_df
Category
Resume
0
Data Science
skills programming languages python pandas...
1
Data Science
education details may 2013 to may 2017 b e ...
2
Data Science
areas of interest deep learning control syste...
3
Data Science
skills r python sap hana table...
4
Data Science
education details mca ymcaust faridabad...
...
...
...
957
Testing
computer skills proficient in ms office ...
958
Testing
willingnes to a ept the challenges po...
959
Testing
personal skills quick learner eagerne...
960
Testing
computer skills software knowledge ms power ...
961
Testing
skill set os windows xp 7 8 8 1 10 database my...
962 rows × 2 columns
Encoding the Category data
# Encoding the Category column using LabelEncoderencoder=LabelEncoder()
cleaned_df['Category'] =encoder.fit_transform(cleaned_df['Category'])
cleaned_df
# Creating a Word Vectorizer and transforming itResume=cleaned_df['Resume'].valuesCategory=cleaned_df['Category'].valuesword_vectorizer=TfidfVectorizer(sublinear_tf=True, stop_words='english', max_features=1000)
word_vectorizer.fit(Resume)
WordFeatures=word_vectorizer.transform(Resume)
Training our Machine Learning Model
Splitting the dataset into train and test data
# Splitting the data into train, test, printing the shape of each and running KNeighborsClassifier with OneVsRest methodX_train, X_test, y_train, y_test=train_test_split(WordFeatures, Category, random_state=2, test_size=0.2)
print(f'The shape of the training data {X_train.shape}')
print(f'The shape of the test data {X_test.shape}')
clf=OneVsRestClassifier(KNeighborsClassifier())
clf.fit(X_train, y_train)
The shape of the training data (769, 1000)
The shape of the test data (193, 1000)
OneVsRestClassifier(estimator=KNeighborsClassifier())
Computing the accuracy metrics and classification report
# Predicting the values using the model built with train data and checking the appropriate metricsprediction=clf.predict(X_test)
print(f'Accuracy of KNeighbors Classifier on test set: {clf.score(X_test, y_test):.2f}\n')
print(f'The classification report \n{metrics.classification_report(y_test, prediction)}\n\n')