Fake Job Postings Detection

This repo contains code for detecting fake job postings using machine learning techniques (app_1_ml.ipynb) and various Neural Network techniques(app_2_nn.ipynb). The code is written in Python and utilizes various libraries such as Pandas, NumPy, Seaborn, Matplotlib, NLTK, Scikit-learn, BeautifulSoup, Spacy, and WordCloud, LightGBM, XGBoost, Keras, CatBoost, and TensorFlow.

Overview for approach-1

The code performs the following tasks:

Import necessary modules.
Import the dataset (fake.csv).
Data exploration and preprocessing:
- Handling missing values.
- Checking for outliers and removing them.
- Exploring the distribution of categorical and numerical features.
Visualization of target variables.
Feature engineering and preprocessing:
- Combining text features.
- Text preprocessing (removing HTML tags, URLs, special characters, stopwords, lemmatization, etc.).
- Vectorizing text data using CountVectorizer.
- Splitting the dataset into training and testing sets.
Model building and evaluation:
- Logistic Regression, Multinomial Naive Bayes, Support Vector Machine, and Decision Tree Classifier are trained and evaluated.
- Evaluation metrics include accuracy, precision, recall, F1 score, and confusion matrix.

Overview for approach-2

The code performs the following tasks:

Import necessary modules and libraries.
Import the dataset (fake.csv) containing job postings.
Data cleaning and preprocessing:
- Standardize text fields by removing special characters, URLs, and non-alphanumeric characters.
- Tokenize and lemmatize text data using NLTK.
- Split the dataset into features (X) and target variable (y).
Data balancing:
- Use NearMiss technique for balancing the dataset.
Model building and evaluation:
- Train various classification models including Logistic Regression, SGD Classifier, Decision Tree Classifier, Random Forest Classifier, AdaBoost Classifier, Gradient Boosting Classifier, HistGradientBoosting Classifier, LightGBM Classifier, XGBoost Classifier, CatBoost Classifier, and LSTM.
- Evaluate models using classification metrics such as accuracy, precision, recall, F1 score, ROC-AUC score, and confusion matrix.
- Compare models' performance with balanced and unbalanced data.
Use of TF-IDF Vectorizer and Word Embeddings for text representation.
Plotting training and validation loss/accuracy curves for deep learning models.

Usage

To run the code:

Clone this repository:

git clone https://github.com/RAHULFROST7/Fake-Job-post-Detection.git

Navigate to the cloned directory:

cd fake job detection

Install the required dependencies.
Execute the app_1_ml.ipynb an machine learing approach or app_2_nn.ipynb an neural network approach.

Requirements

Ensure you have Python installed on your system. Additionally, the following Python packages are required:

numpy
pandas
seaborn
matplotlib
nltk
scikit-learn
wordcloud
beautifulsoup4
spacy
lightgbm
xgboost
keras
catboost
tensorflow
imbalanced-learn