This repo contains code for detecting fake job postings using machine learning techniques (app_1_ml.ipynb
) and various Neural Network techniques(app_2_nn.ipynb
). The code is written in Python and utilizes various libraries such as Pandas, NumPy, Seaborn, Matplotlib, NLTK, Scikit-learn, BeautifulSoup, Spacy, and WordCloud, LightGBM, XGBoost, Keras, CatBoost, and TensorFlow.
The code performs the following tasks:
- Import necessary modules.
- Import the dataset (
fake.csv
). - Data exploration and preprocessing:
- Handling missing values.
- Checking for outliers and removing them.
- Exploring the distribution of categorical and numerical features.
- Visualization of target variables.
- Feature engineering and preprocessing:
- Combining text features.
- Text preprocessing (removing HTML tags, URLs, special characters, stopwords, lemmatization, etc.).
- Vectorizing text data using CountVectorizer.
- Splitting the dataset into training and testing sets.
- Model building and evaluation:
- Logistic Regression, Multinomial Naive Bayes, Support Vector Machine, and Decision Tree Classifier are trained and evaluated.
- Evaluation metrics include accuracy, precision, recall, F1 score, and confusion matrix.
The code performs the following tasks:
- Import necessary modules and libraries.
- Import the dataset (
fake.csv
) containing job postings. - Data cleaning and preprocessing:
- Standardize text fields by removing special characters, URLs, and non-alphanumeric characters.
- Tokenize and lemmatize text data using NLTK.
- Split the dataset into features (
X
) and target variable (y
).
- Data balancing:
- Use NearMiss technique for balancing the dataset.
- Model building and evaluation:
- Train various classification models including Logistic Regression, SGD Classifier, Decision Tree Classifier, Random Forest Classifier, AdaBoost Classifier, Gradient Boosting Classifier, HistGradientBoosting Classifier, LightGBM Classifier, XGBoost Classifier, CatBoost Classifier, and LSTM.
- Evaluate models using classification metrics such as accuracy, precision, recall, F1 score, ROC-AUC score, and confusion matrix.
- Compare models' performance with balanced and unbalanced data.
- Use of TF-IDF Vectorizer and Word Embeddings for text representation.
- Plotting training and validation loss/accuracy curves for deep learning models.
To run the code:
- Clone this repository:
git clone https://github.com/RAHULFROST7/Fake-Job-post-Detection.git
- Navigate to the cloned directory:
cd fake job detection
-
Install the required dependencies.
-
Execute the
app_1_ml.ipynb
an machine learing approach orapp_2_nn.ipynb
an neural network approach.
Ensure you have Python installed on your system. Additionally, the following Python packages are required:
- numpy
- pandas
- seaborn
- matplotlib
- nltk
- scikit-learn
- wordcloud
- beautifulsoup4
- spacy
- lightgbm
- xgboost
- keras
- catboost
- tensorflow
- imbalanced-learn