srmocher/fake-science

Analyzing Common Crawl data (specifically) to classify fake/real based on trained deep learning models (LSTM, CNN)

Python

Real/Fake Science

Steps

Task: Download Common Crawl Index Client tool from https://github.com/ikreymer/cdx-index-client

Run queries for url patterns like ".com/climate/",".com/science/" etc to get the matching URLs which will be stored in a file

Task: Get the HTML from the URLs and store in Solr

Run the script cc_solr.py with parameters modified accordingly.

TODO - compare existing fake training data to data pushed to Solr from CC to extract larger fake training set.
Task: Added code to train LSTM model on Climate data using Google News word2vec model. Download Google News word2vec model from below model https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing

TODO - add code for training CNN model.