Disinformation Case Study

Preliminary

Required Libraries
- numpy 1.20.1
- scipy 1.6.2
- pandas 1.2.4
- nltk 3.6.2
- gensim 4.0.1
- sklearn 0.24.1
- torch 1.9.0
Load Dataset There are two datasets and one pre-trained language model need to be downloaded and placed in the "fake-and-real-news-dataset" folder. They are (1) fake news data (23,538 articles), (2) real news data (21,418 articles), and (3) Google pre-trained word2vec model (3 million words and each has a 300-dim vector). Also, you have the backup online storages, fake news data here, real news data here, and pre-trained word2vec here.
Article Embedding
- Word Graph Construction. We contrust an undirected word graph for each input news article. Briefly, if two words co-occur in a length-specified sliding window, then there will be an edge connecting these two words. For example, "I eat an apple" and the length of the window is 3, then edges could be {I-eat, I-an, eat-an, eat-apple, an-apple} (with stop words kept). More details of constructing a word graph can be found at TextRank.
- Geometric Feature Extraction. We use the idea of the SDG model to obtain node embeddings. Briefy, a node's representation is aggregated based on its personalized PageRank vector weighted neighours' features. Then we call any pooling function (like sum pooling or mean pooling) to aggregate node embeddings into the graph-level representation vector for each constructed word graph.
- You can run sdg_model.py to extract vector representation for each news article, which will then store a feature matrix and a label matrix for all observed fake/real news articles. Also, you can access the extracted feature_matrix.pkl and label_matrix.pkl, and put them into the root directory of this repository.

Study Directions

1. Detection Effectiveness
- 1.1 [Accuracy] Run classification_acc.py, which is responsible for training an accuracy-acceptable classifier (e.g. MuFasa model or MLP) and saving the classification model.
- 1.2 [Precision, Recall, and F1-score] Run other_metrics.py to test the classifer in terms of other metrics, like precision, recall, and F1-score.
2. Detection Explanation
- 2.1 [Misleading Degree] Run top_n_words.py to find top n misleading words of any news article. The misleading degree of each word is calculated in this way - the ground-truth prediction probability difference after masking a certain node. For example, when the word w is in the fake article news a, the article is detected as fake news with probability p; when we mask word w in the fake article news a, then that article is detected as fake news with probability q. Hence, the probability gain (i.e., misleading degree) of w to article a is (q-p). If (q-p) is greater than 0, it means that word w helps camouflage the fake news a to bypass the detection.
3. Detection Robustness
- 3.1 [Varying Feature Dimensions] Run dimension_redution.py to reduce the aricle embedding dimension by PCA, and train a new classifier (e.g. MuFasa model or MLP) on the truncated feature matrix.
- 3.2 [Label Noise Injection] Run noisy_label_injection.py to flip a certain amount (e.g., 5%) of training labels, then to see the robustness w.r.t classification performance like accuracy, precision, recall, and F1-score.

DongqiFu/Disinfomation_Case_Study

Disinformation Case Study

Preliminary

Study Directions