This project aims to develop a machine learning model to classify news articles as either real or fake. We use a dataset of labeled news articles and implement various natural language processing and deep learning techniques to create an effective classifier.
We use the ISOT Fake News Dataset, which contains two types of articles: fake and real news.
- Source: The dataset was collected from real-world sources.
- Content:
- True.csv: Contains over 12,600 true articles from Reuters.com
- Fake.csv: Contains over 12,600 fake articles from various unreliable websites
- Time Period: Focused on articles from 2016 to 2017
- Features: Each article contains the title, text, type, and publication date
- URL removal
- Special character and number removal
- Text lowercasing
- Stopword removal
- Stemming
- Text length calculation
- Title length calculation
- Readability score computation (using Flesch-Kincaid Grade Level)
- Sentiment analysis for title and text
- TF-IDF (Term Frequency-Inverse Document Frequency) vectorization
We implement two neural network models:
- Simple Model
- Optimized Model
- L1/L2 Regularization
- Dropout
- Batch Normalization
- Early Stopping
- Learning Rate Adjustment
- Hyperparameter Tuning
- pandas, numpy: Data manipulation
- matplotlib, seaborn: Data visualization
- nltk: Natural Language Processing
- scikit-learn: Machine Learning utilities
- tensorflow: Deep Learning framework
- textstat: Readability scoring
- textblob: Sentiment analysis
- Input Layer
- 3 Dense Layers with ReLU activation
- Output Layer with Sigmoid activation
- Input Layer
- 3 Dense Layers with ReLU activation and L1/L2 regularization
- Batch Normalization after each Dense Layer
- Dropout Layers
- Output Layer with Sigmoid activation
We use a manual random search to tune the following hyperparameters:
- Batch size
- Number of epochs
- Learning rate
- Number of neurons in the first layer
- Simple Model Accuracy: 0.9915
- Optimized Model Accuracy: 0.9812
- Accuracy Improvement: -0.0104
In this project, we successfully developed and compared two machine learning models for classifying news articles as fake or real. The simple model achieved a higher accuracy compared to the optimized model. Despite the optimizations, the simple model performed better, suggesting that further tuning or different architectures may be necessary for significant improvements.
- Experiment with more advanced architectures (e.g., LSTM, transformer-based models)
- Incorporate external knowledge bases for fact-checking
- Implement an ensemble of different models
- Explore more features related to writing style and source credibility
- Clone this repository
- Install the required packages:
pip install -r requirements.txt
- Run the Jupyter notebook or Python script
- The trained models will be saved in the 'saved_models' directory
- Ahmed H, Traore I, Saad S. "Detecting opinion spams and fake news using text classification", Journal of Security and Privacy, Volume 1, Issue 1, Wiley, January/February 2018.
- Ahmed H, Traore I, Saad S. (2017) "Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques. In: Traore I., Woungang I., Awad A. (eds) Intelligent, Secure, and Dependable Systems in Distributed and Cloud Environments. ISDDC 2017. Lecture Notes in Computer Science, vol 10618. Springer, Cham (pp. 127- 138).